PySpark : Retrieving Unique Elements from two arrays in PySpark

PySpark @ Freshers.in

Let’s start by creating a DataFrame named freshers_in. We’ll make it contain two array columns named ‘array1’ and ‘array2’, filled with hard-coded values.

from pyspark.sql import SparkSession
from pyspark.sql.functions import array

# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

data = [(["java", "c++", "python"], ["python", "java", "scala"]),
        (["javascript", "c#", "java"], ["java", "javascript", "php"]),
        (["ruby", "php", "c++"], ["c++", "ruby", "perl"])]

# Create DataFrame
freshers_in = spark.createDataFrame(data, ["array1", "array2"])
freshers_in.show(truncate=False)

The show() function will display the DataFrame freshers_in, which should look something like this:

+-------------------+-------------------+
|array1             |array2             |
+-------------------+-------------------+
|[java, c++, python]|[python, java, scala]|
|[javascript, c#, java]|[java, javascript, php]|
|[ruby, php, c++]|[c++, ruby, perl]|
+-------------------+-------------------+
To create a new array column containing unique elements from ‘array1’ and ‘array2’, we can utilize the concat() function to merge the arrays and the array_distinct() function to extract the unique elements.
from pyspark.sql.functions import array_distinct, concat
# Add 'unique_elements' column
freshers_in = freshers_in.withColumn("unique_elements", array_distinct(concat("array1", "array2")))
freshers_in.show(truncate=False)
ResultĀ 
+-------------------+-------------------+-----------------------------------+
|array1             |array2             |unique_elements                    |
+-------------------+-------------------+-----------------------------------+
|[java, c++, python]|[python, java, scala]|[java, c++, python, scala]        |
|[javascript, c#, java]|[java, javascript, php]|[javascript, c#, java, php]    |
|[ruby, php, c++]|[c++, ruby, perl]|[ruby, php, c++, perl]                 |
+-------------------+-------------------+-----------------------------------+

unique_elements column is a unique combination of the elements from the ‘array1’ and ‘array2’ columns.

Note that PySpark’s array functions treat NULLs as valid array elements. If your arrays could contain NULLs, and you want to exclude them from the result, you should filter them out before applying the array_distinct and concat operations.

Author: user

Leave a Reply