PySpark : How to create a map from a column of structs : map_from_entries

PySpark @ Freshers.in

pyspark.sql.functions.map_from_entries

map_from_entries(col) is a function in PySpark that creates a map from a column of structs, where the structs have two fields: key and value. This is a collection function which returns a map created from the given array of entries

from pyspark.sql.functions import map_from_entries, struct
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
df2 = spark.createDataFrame([
(1, "John", 25000, [("name","John"), ("age",25)]), 
(2, "Mike", 30000, [("name","Mike"),("age",30)]), 
(3, "Sophia", 35000, [("name","Sophia"), ("age",35)])
], 
["id", "name", "salary", "person_map"])
df2 = df2.select("id","name", "salary", map_from_entries("person_map").alias("map_col"))
df2.show(20,False)

In this example, we first import the necessary functions and create a SparkSession. We then create a DataFrame with a column called “person_map” which contains a list of structs each with two fields “key” and “value”.

We then use the map_from_entries() function to create a new column called “map_col” from the struct column, using the alias() function to rename the new column.

The “map_col” is used to select the fields of the structs to be used as key and value for the map.

The final DataFrame has two columns: “id” and “map_col”, where “map_col” contains a map created from the structs in “struct_col”.

For reference , the schema will beĀ 

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- map_col: map (nullable = true)
 |    |-- key: stringap_col")["name"]).show()
 |    |-- value: string (valueContainsNull = true)
|map_col[name]|

Result

+---+------+------+---------------------------+
|id |name  |salary|map_col                    |
+---+------+------+---------------------------+
|1  |John  |25000 |[name -> John, age -> 25]  |
|2  |Mike  |30000 |[name -> Mike, age -> 30]  |
|3  |Sophia|35000 |[name -> Sophia, age -> 35]|
+---+------+------+---------------------------+

In PySpark, creating a map column from entries allows you to convert existing columns in a DataFrame into a map, where each row in the DataFrame becomes a key-value pair in the map. This can be useful for organizing and structuring data in a more readable and efficient way. Additionally, it can also be used to perform operations such as filtering, aggregation and joining on the map column.

Author: user

Leave a Reply