PySpark : How to Prepending an Element to an Array on specific condition in PySpark

PySpark @ Freshers.in

If you want to prepend an element to the array only when the array contains a specific word, you can achieve this with the help of PySpark’s when() and otherwise() functions along with array_contains(). The when() function allows you to specify a condition, the array_contains() function checks if an array contains a certain value, and the otherwise() function allows you to specify what should happen if the condition is not met.

Here is the example to prepend an element only when the array contains the word “four”.

from pyspark.sql import SparkSession
from pyspark.sql.functions import array
from pyspark.sql.functions import when, array_contains, lit, array, concat
# Initialize a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create a DataFrame
data = [("fruits", ["apple", "banana", "cherry", "date", "elderberry"]),
        ("numbers", ["one", "two", "three", "four", "five"]),
        ("colors", ["red", "blue", "green", "yellow", "pink"])]
df = spark.createDataFrame(data, ["Category", "Items"])
df.show()
######################
# Element to prepend
#####################
element = "zero"
# Prepend the element only when the array contains "four"
df = df.withColumn("Items", when(array_contains(df["Items"], "four"), 
                                  concat(array(lit(element)), df["Items"]))
                             .otherwise(df["Items"]))
df.show(20,False)

Source Data

+--------+-----------------------------------------+
|Category|Items                                    |
+--------+-----------------------------------------+
|fruits  |[apple, banana, cherry, date, elderberry]|
|numbers |[one, two, three, four, five]            |
|colors  |[red, blue, green, yellow, pink]         |
+--------+-----------------------------------------+

Output

+--------+-----------------------------------------+
|Category|Items                                    |
+--------+-----------------------------------------+
|fruits  |[apple, banana, cherry, date, elderberry]|
|numbers |[zero, one, two, three, four, five]      |
|colors  |[red, blue, green, yellow, pink]         |
+--------+-----------------------------------------+

In this code, when(array_contains(df[“Items”], “four”), concat(array(lit(element)), df[“Items”])) prepends the element to the array if the array contains “four“. If the array does not contain “four“, otherwise(df[“Items”]) leaves the array as it is.

This results in a new DataFrame where “zero” is prepended to the array in the “Items” column only if the array contains “four“.

Prepending an Element to an Array in PySpark

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply