PySpark : Dropping duplicate rows in Pyspark – A Comprehensive Guide with example

user March 29, 2023 Leave a Comment

PySpark provides several methods to remove duplicate rows from a dataframe. In this article, we will go over the steps to drop duplicate rows in Pyspark.

First, let’s create a sample dataframe with 5 columns. We will use the createDataFrame() method of the SparkSession object to create a dataframe.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.appName("drop-duplicates").getOrCreate()
# Create a sample dataframe with 5 columns
data = [("Sachin", 25, "M", "New York", 20000),
        ("Sharry", 30, "F", "San Francisco", 30000),
        ("Mandid", 35, "M", "Los Angeles", 40000),
        ("Sachin", 25, "M", "New York", 20000),
        ("Sharry", 30, "F", "San Francisco", 30000)]
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Gender", StringType(), True),
    StructField("City", StringType(), True),
    StructField("Salary", IntegerType(), True)
])
df = spark.createDataFrame(data, schema=schema)
df.show()

Output

+------+---+------+-------------+------+
|  Name|Age|Gender|         City|Salary|
+------+---+------+-------------+------+
|Sachin| 25|     M|     New York| 20000|
|Sharry| 30|     F|San Francisco| 30000|
|Mandid| 35|     M|  Los Angeles| 40000|
|Sachin| 25|     M|     New York| 20000|
|Sharry| 30|     F|San Francisco| 30000|
+------+---+------+-------------+------+

As you can see, there are some duplicate rows in the dataframe. Now, let’s drop these duplicate rows.

Method 1: Using dropDuplicates()

The simplest way to drop duplicate rows in Pyspark is to use the dropDuplicates() method. This method returns a new dataframe with the duplicate rows removed.

# Drop duplicate rows using dropDuplicates() method
df = df.dropDuplicates()
df.show()

Output : Duplicate rows have been removed.

+------+---+------+-------------+------+
|  Name|Age|Gender|         City|Salary|
+------+---+------+-------------+------+
|Sachin| 25|     M|     New York| 20000|
|Mandid| 35|     M|  Los Angeles| 40000|
|Sharry| 30|     F|San Francisco| 30000|
+------+---+------+-------------+------+

Method 2: Using groupBy() and agg() functions

Another way to drop duplicate rows is to use the groupBy() and agg() functions. This method groups the dataframe by all the columns and then aggregates the data using any aggregation function, such as first() or last(). This method is useful when you want to retain only one row for each combination of column values.

# Drop duplicate rows using groupBy() and agg() functions
df = df.groupBy(df.columns)\
    .agg(first(df.columns[0]), first(df.columns[1]), first(df.columns[2]), first(df.columns[3]), first(df.columns[4]))
df.show()

Other column you can drop , if not required. This is for your understanding purpose.

+------+---+------+-------------+------+-----------+----------+-------------+-------------+-------------+
|  Name|Age|Gender|         City|Salary|first(Name)|first(Age)|first(Gender)|  first(City)|first(Salary)|
+------+---+------+-------------+------+-----------+----------+-------------+-------------+-------------+
|Sharry| 30|     F|San Francisco| 30000|     Sharry|        30|            F|San Francisco|        30000|
|Mandid| 35|     M|  Los Angeles| 40000|     Mandid|        35|            M|  Los Angeles|        40000|
|Sachin| 25|     M|     New York| 20000|     Sachin|        25|            M|     New York|        20000|
+------+---+------+-------------+------+-----------+----------+-------------+-------------+-------------+

As you can see, the duplicate rows have been removed and only one row is retained for each combination of column values.

Spark important urls to refer.

Post Views: 19

Author: user

PySpark : Dropping duplicate rows in Pyspark – A Comprehensive Guide with example

Method 1: Using dropDuplicates()

Method 2: Using groupBy() and agg() functions

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Security Features of Snowflake

Snowflake Savings: Mastering Cost Optimization Strategies

Snowflake’s Snowpipe to ingest streaming data from an AWS S3 bucket

Most Viewed Posts

Method 1: Using dropDuplicates()

Method 2: Using groupBy() and agg() functions

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget