PySpark : Returning an Array that Contains Matching Elements in Two Input Arrays in PySpark

PySpark @

This article will focus on a particular use case: returning an array that contains the matching elements in two input arrays in PySpark. To illustrate this, we’ll use PySpark’s built-in functions and DataFrame transformations.

PySpark does not provide a direct function to compare arrays and return the matching elements. However, you can achieve this by utilizing some of its in-built functions like explode, collect_list, and array_intersect.

Let’s assume we have a DataFrame that has two columns, both of which contain arrays:

from pyspark.sql import SparkSession
from pyspark.sql.functions import array
spark = SparkSession.builder.getOrCreate()
data = [
    ("1", list(["apple", "banana", "cherry"]), list(["banana", "cherry", "date"])),
    ("2", list(["pear", "mango", "peach"]), list(["mango", "peach", "lemon"])),
df = spark.createDataFrame(data, ["id", "Array1", "Array2"])

DataFrame is created successfully.

To return an array with the matching elements in ‘Array1’ and ‘Array2’, use the array_intersect function:

from pyspark.sql.functions import array_intersect

df_with_matching_elements = df.withColumn("MatchingElements", array_intersect(df.Array1, df.Array2)),False)

The ‘MatchingElements’ column will contain the matching elements in ‘Array1’ and ‘Array2’ for each row.

Using the PySpark array_intersect function, you can efficiently find matching elements in two arrays. This function is not only simple and efficient but also scalable, making it a great tool for processing and analyzing big data with PySpark. It’s important to remember, however, that this approach works on a row-by-row basis. If you want to find matches across all rows in the DataFrame, you’ll need to apply a different technique.

|id |Array1                 |Array2                |MatchingElements|
|1  |[apple, banana, cherry]|[banana, cherry, date]|[banana, cherry]|
|2  |[pear, mango, peach]   |[mango, peach, lemon] |[mango, peach]  |

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply