How to transform a JSON Column to multiple columns based on Key in PySpark

user February 5, 2022 Leave a Comment on How to transform a JSON Column to multiple columns based on Key in PySpark

JSON Column to multiple columns

Consider you have situation with incoming raw data got a json column, and you need to transform each key separate column for further analysis. Here we will learn

How to read a json column using PySpark?
How to have create the schema for JSON Column?
How to transform Key as column name in dataframe from key value ?

Source Code

from pyspark.sql.types import *
from pyspark.sql.types import MapType,StringType,IntegerType
data = [
(1,{"city":"Baltimore","zip_code":21201,"county":"Baltimore City"},"USA"),
(2,{"city":"East Case","zip_code":21202,"county":"Baltimore City"},"USA"),
(3,{"city":"Ruxton","zip_code":21204,"county":"Baltimore County"},"USA"),
(4,{"city":"Orchard Beach","county":"Anne Arundel County"},"USA"),
(5,{"city":"Arbutus","zip_code":21227,"county":"Baltimore County"},"USA"),
]
schema = StructType([
StructField("si_no",IntegerType(),True),
StructField("city_info",MapType(StringType(),StringType(),True)),
StructField("country",StringType(),True),
]) 
df = spark.createDataFrame(data,schema)
df.show(20,False)
df.printSchema()
df2 = df.select(df.si_no,df.city_info.city.alias('city'),\
df.city_info.zip_code.cast(IntegerType()).alias('zip_code'),\
df.city_info.county.alias('county'),\
df.country)
df2.show(20,False)
df2.printSchema()

Reference

Spark Examples
PySpark Blogs
Bigdata Blogs
Spark Interview Questions
Official Page
How to parses a column containing a JSON string using PySpark(from_json)

Execution Result

Post Views: 1,396

How to parses a column containing a JSON string using PySpark(from_json)
from_json If you have JSON object in a column, and need to do any transformation…
Converts a column containing a StructType, ArrayType or a MapType into a JSON string-PySpark(to_json)
You can convert a column containing a StructType, ArrayType or a MapType into a JSON…
How to get json object from a json string based on json path specified - get_json_object - PySpark
get_json_object get_json_object will extracts json object from a json string based on json path mentioned…
PySpark : Transforming a column of arrays or maps into multiple rows : Converting rows into columns
pyspark.sql.functions.explode_outer In PySpark, the explode() function is used to transform a column of arrays or…
How to convert MapType to multiple columns based on Key using PySpark ?
Use case : Converting Map to multiple columns. There can be raw data with Maptype…
PySpark : Adding a specified number of days to a date column in PySpark
pyspark.sql.functions.date_add The date_add function in PySpark is used to add a specified number of days…
How can you convert PySpark Dataframe to JSON ?
pyspark.sql.DataFrame.toJSON There may be some situation that you need to send your dataframe to a…
PySpark : How to convert a sequence of key-value pairs into a dictionary in PySpark
pyspark.sql.functions.create_map create_map is a function in PySpark that is used to convert a sequence of…
PySpark : Sort an array of elements in a DataFrame column
pyspark.sql.functions.array_sort The array_sort function is a PySpark function that allows you to sort an array…
PySpark-How to returns the first column that is not null
pyspark.sql.functions.coalesce If you want to return the first non zero from list of column you…