PySpark : Combine the elements of two or more arrays in a DataFrame column

PySpark @ Freshers.in

pyspark.sql.functions.array_union

The array_union function is a PySpark function that allows you to combine the elements of two or more arrays in a DataFrame column. The function takes one or more column names as its arguments, and returns a new DataFrame with a new column that contains the union of the elements of the input columns.

Syntax

pyspark.sql.functions.array_union(col1, col2)

Here’s an example of how to use the array_union function:

from pyspark.sql import SparkSession
from pyspark.sql.functions import array_union
# Create a SparkSession
spark = SparkSession.builder.appName("PySparkArrayUnion").getOrCreate()
# Create a DataFrame with sample data
data = [("Jamaica Wills", [1, 2, 3], [2, 3, 4]), ("Tim Bob", [3, 4, 5],[4, 5, 6]), ("Mike Walters", [5, 6, 7],[6,7,8])]
df = spark.createDataFrame(data, ["name", "numbers", "numbers2"])
# Union the elements of the "numbers" and "numbers2" columns using array_union
union_df = df.select("name", array_union("numbers","numbers2"))
# Show the union DataFrame
union_df.show()

ResultĀ 

+-------------+------------------------------+
|         name|array_union(numbers, numbers2)|
+-------------+------------------------------+
|Jamaica Wills|                  [1, 2, 3, 4]|
|      Tim Bob|                  [3, 4, 5, 6]|
| Mike Walters|                  [5, 6, 7, 8]|
+-------------+----------------------------

You can also pass multiple column names to the function, in order to combine the elements of those columns.

As you can see, the new column contains the union of the elements of the input columns. It removes duplicate elements. This function is particularly useful when you want to combine the elements of multiple arrays columns and make it easier to query and analyze the data in the DataFrame.

Author: user

Leave a Reply