pyspark.sql.functions.arrays_overlap
The arrays_overlap function is a PySpark function that allows you to check if two or more arrays in a DataFrame column have any common elements. The function takes one or more column names as its arguments, and returns a new DataFrame with a new column that contains a boolean value indicating whether the arrays have any common elements or not.
Here’s an example of how to use the arrays_overlap function:
from pyspark.sql import SparkSession
from pyspark.sql.functions import arrays_overlap
# Create a SparkSession
spark = SparkSession.builder.appName("PySparkArrayOverlap").getOrCreate()
# Create a DataFrame with sample data
data = [("Jamaica Wills", [1, 2, 3], [2, 3, 4]), ("Tim Bob", [3, 4, 5],[4, 5, 6]), ("Mike Walters", [5, 6, 7],[6,7,8])]
df = spark.createDataFrame(data, ["name", "numbers", "numbers2"])
# Check if the elements of the "numbers" and "numbers2" columns overlap using arrays_overlap
overlap_df = df.select("name", arrays_overlap("numbers","numbers2"))
# Show the overlap DataFrame
overlap_df.show()
In this example, we create a DataFrame with three columns: “name”, “numbers” and “numbers2”. Both “numbers” and “numbers2” columns are arrays of integers. We then use the arrays_overlap function to check if the elements of the “numbers” and “numbers2” columns have any common elements. The function returns a new DataFrame with a new column that contains a boolean value indicating whether the arrays have any common elements or not, which we assign to the variable overlap_df. The original DataFrame remains unchanged.
Result
+-------------+---------------------------------+
| name|arrays_overlap(numbers, numbers2)|
+-------------+---------------------------------+
|Jamaica Wills| true|
| Tim Bob| true|
| Mike Walters| true|
+-------------+---------------------------------+
As you can see, the new column contains the boolean value that indicates whether the elements of the input arrays have any common elements or not. If any common elements are found, it returns True, otherwise False.
You can also pass multiple column names to the function, in order to check if the elements of those columns have any common elements.
This function is particularly useful when you want to check if the elements of multiple arrays columns have any common elements and make it easier to query and analyze the data in the DataFrame. This can be useful for example when you are trying to find the common elements among two or more data sets.
Spark important urls to refer