PySpark : Dropping duplicate rows in Pyspark – A Comprehensive Guide with example

PySpark @ Freshers.in

PySpark provides several methods to remove duplicate rows from a dataframe. In this article, we will go over the steps to drop duplicate rows in Pyspark.

First, let’s create a sample dataframe with 5 columns. We will use the createDataFrame() method of the SparkSession object to create a dataframe.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.appName("drop-duplicates").getOrCreate()
# Create a sample dataframe with 5 columns
data = [("Sachin", 25, "M", "New York", 20000),
        ("Sharry", 30, "F", "San Francisco", 30000),
        ("Mandid", 35, "M", "Los Angeles", 40000),
        ("Sachin", 25, "M", "New York", 20000),
        ("Sharry", 30, "F", "San Francisco", 30000)]
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Gender", StringType(), True),
    StructField("City", StringType(), True),
    StructField("Salary", IntegerType(), True)
])
df = spark.createDataFrame(data, schema=schema)
df.show()

Output

+------+---+------+-------------+------+
|  Name|Age|Gender|         City|Salary|
+------+---+------+-------------+------+
|Sachin| 25|     M|     New York| 20000|
|Sharry| 30|     F|San Francisco| 30000|
|Mandid| 35|     M|  Los Angeles| 40000|
|Sachin| 25|     M|     New York| 20000|
|Sharry| 30|     F|San Francisco| 30000|
+------+---+------+-------------+------+

As you can see, there are some duplicate rows in the dataframe. Now, let’s drop these duplicate rows.

Method 1: Using dropDuplicates()

The simplest way to drop duplicate rows in Pyspark is to use the dropDuplicates() method. This method returns a new dataframe with the duplicate rows removed.

# Drop duplicate rows using dropDuplicates() method
df = df.dropDuplicates()
df.show()

Output : Duplicate rows have been removed.

+------+---+------+-------------+------+
|  Name|Age|Gender|         City|Salary|
+------+---+------+-------------+------+
|Sachin| 25|     M|     New York| 20000|
|Mandid| 35|     M|  Los Angeles| 40000|
|Sharry| 30|     F|San Francisco| 30000|
+------+---+------+-------------+------+

Method 2: Using groupBy() and agg() functions

Another way to drop duplicate rows is to use the groupBy() and agg() functions. This method groups the dataframe by all the columns and then aggregates the data using any aggregation function, such as first() or last(). This method is useful when you want to retain only one row for each combination of column values.

# Drop duplicate rows using groupBy() and agg() functions
df = df.groupBy(df.columns)\
    .agg(first(df.columns[0]), first(df.columns[1]), first(df.columns[2]), first(df.columns[3]), first(df.columns[4]))
df.show()

Other column you can drop , if not required. This is for your understanding purpose.

+------+---+------+-------------+------+-----------+----------+-------------+-------------+-------------+
|  Name|Age|Gender|         City|Salary|first(Name)|first(Age)|first(Gender)|  first(City)|first(Salary)|
+------+---+------+-------------+------+-----------+----------+-------------+-------------+-------------+
|Sharry| 30|     F|San Francisco| 30000|     Sharry|        30|            F|San Francisco|        30000|
|Mandid| 35|     M|  Los Angeles| 40000|     Mandid|        35|            M|  Los Angeles|        40000|
|Sachin| 25|     M|     New York| 20000|     Sachin|        25|            M|     New York|        20000|
+------+---+------+-------------+------+-----------+----------+-------------+-------------+-------------+

As you can see, the duplicate rows have been removed and only one row is retained for each combination of column values.

Spark important urls to refer.

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply