PySpark : Retrieving Unique Elements from two arrays in PySpark

user July 4, 2023 Leave a Comment

Let’s start by creating a DataFrame named freshers_in. We’ll make it contain two array columns named ‘array1’ and ‘array2’, filled with hard-coded values.

from pyspark.sql import SparkSession
from pyspark.sql.functions import array

# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

data = [(["java", "c++", "python"], ["python", "java", "scala"]),
        (["javascript", "c#", "java"], ["java", "javascript", "php"]),
        (["ruby", "php", "c++"], ["c++", "ruby", "perl"])]

# Create DataFrame
freshers_in = spark.createDataFrame(data, ["array1", "array2"])
freshers_in.show(truncate=False)

The show() function will display the DataFrame freshers_in, which should look something like this:

+-------------------+-------------------+
|array1             |array2             |
+-------------------+-------------------+
|[java, c++, python]|[python, java, scala]|
|[javascript, c#, java]|[java, javascript, php]|
|[ruby, php, c++]|[c++, ruby, perl]|
+-------------------+-------------------+

To create a new array column containing unique elements from ‘array1’ and ‘array2’, we can utilize the concat() function to merge the arrays and the array_distinct() function to extract the unique elements.

from pyspark.sql.functions import array_distinct, concat
# Add 'unique_elements' column
freshers_in = freshers_in.withColumn("unique_elements", array_distinct(concat("array1", "array2")))
freshers_in.show(truncate=False)

Result

+-------------------+-------------------+-----------------------------------+
|array1             |array2             |unique_elements                    |
+-------------------+-------------------+-----------------------------------+
|[java, c++, python]|[python, java, scala]|[java, c++, python, scala]        |
|[javascript, c#, java]|[java, javascript, php]|[javascript, c#, java, php]    |
|[ruby, php, c++]|[c++, ruby, perl]|[ruby, php, c++, perl]                 |
+-------------------+-------------------+-----------------------------------+

unique_elements column is a unique combination of the elements from the ‘array1’ and ‘array2’ columns.

Note that PySpark’s array functions treat NULLs as valid array elements. If your arrays could contain NULLs, and you want to exclude them from the result, you should filter them out before applying the array_distinct and concat operations.

Spark important urls to refer

Post Views: 8

Author: user

PySpark : Retrieving Unique Elements from two arrays in PySpark

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Most Viewed Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget