PySpark’s array_join function is used to concatenate elements of an array into a single string, with the elements separated by a specified delimiter. The function takes two arguments: the array to be concatenated and the delimiter to use.
array_join(array, delimiter [, nullReplacement])
Here is an example of how to use the array_join function in PySpark:
from pyspark.sql.functions import array_join
# Create a sample dataframe
data = [("John", ["apple", "banana", "orange"]), ("Jane", ["grapes", "pineapple", "kiwi"])]
df = spark.createDataFrame(data, ["name", "fruits"])
# Use the array_join function to concatenate the elements of the "fruits" column into a single string
df = df.withColumn("fruits_list", array_join("fruits", ","))
# Show the result, False)
This will output:
|name|fruits |fruits_list |
|John|[apple, banana, orange] |apple,banana,orange |
|Jane|[grapes, pineapple, kiwi]|grapes,pineapple,kiwi|
In this example, array_join function is used to concatenate the elements of the “fruits” column, which is an array of strings, into a single string. The delimiter used is a comma. The result of the function is stored in a new column named “fruits_list”.
You can also use the array_join function on a specific columns, like this:
df.selectExpr("name", "array_join(fruits, ',') as fruits_list").show(20, False)
|name|fruits_list |
|John|apple,banana,orange |
This will give you the same output as previous example, but in this case it’s used as a function with column name as argument.
It’s important to note that the array_join function only works on columns of type array and the resulting column will always be of type string. Also, the delimiter passed to the function should be a string.
Spark important urls to refer