PySpark : Transforming a column of arrays or maps into multiple rows : Converting rows into columns

PySpark @


In PySpark, the explode() function is used to transform a column of arrays or maps into multiple rows, with one row for each element in the array or map. The explode_outer() is similar to explode() but it will return null for the non-array column.

Every element of the specified array or map receives a new row in the response. In contrast to explode, null is produced if the array or map is empty or null. Unless otherwise provided, uses key and value for elements in the map and the default column name col for array elements.

Here is an example of using explode_outer() to transform a DataFrame with a column of arrays:

from pyspark.sql.functions import explode_outer
# Create a DataFrame with a column of arrays
data = [
    (1, ["BMW", "Audi", "Merc",]),
    (2, ["Maruti", "Toyota"]),
    (3, None),
    (4, ["Volkswagen"])
df = spark.createDataFrame(data, ["id", "cars"])
# Use explode_outer to transform the column of arrays
exploded_df ="id", explode_outer("cars"))
# Show the resulting DataFrame

This will output:

| id|       col|
|  1|       BMW|
|  1|      Audi|
|  1|      Merc|
|  2|    Maruti|
|  2|    Toyota|
|  3|      null|
|  4|Volkswagen|

Here the column “cars” is exploded and each element is a new row. Also the rows with null values are also retained.

Spark important urls

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply