PySpark : Removing all occurrences of a specified element from an array column in a DataFrame

PySpark @ Freshers.in

pyspark.sql.functions.array_remove

Syntax

pyspark.sql.functions.array_remove(col, element)

pyspark.sql.functions.array_remove is a function that removes all occurrences of a specified element from an array column in a DataFrame. This is a collection function remove all elements that equal to element from the given array. For example, if you have a DataFrame with a column named “colors” that contains arrays of strings, you can use array_remove to remove the string “red” from all arrays in that column:

from pyspark.sql.functions import array_remove
df = spark.createDataFrame([(1, ["red", "blue", "green"]), (2, ["yellow", "red", "purple"])], ["id", "colors"])
df.show(20,False)
+---+---------------------+
|id |colors               |
+---+---------------------+
|1  |[red, blue, green]   |
|2  |[yellow, red, purple]|
+---+---------------------+

No we need to remove “red” from the column “colors”

df.select("id", array_remove("colors", "red").alias("new_colors")).show()

Result

+---+----------------+
| id|      new_colors|
+---+----------------+
|  1|   [blue, green]|
|  2|[yellow, purple]|
+---+----------------+

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply