How to derive the schema of a JSON string in PySpark

user November 24, 2023

The schema_of_json function in PySpark is used to derive the schema of a JSON string. This schema can then be used to parse JSON data in DataFrames effectively. It is especially useful when dealing with semi-structured JSON data where the schema might not be consistent or known in advance.

Advantages of using schema_of_json

Schema Inference: Automatically infers the schema from JSON data.
Flexibility: Handles varying and nested JSON structures.
Efficiency: Improves parsing speed by understanding the data structure beforehand.

Implementing schema_of_json in PySpark

To demonstrate the use of schema_of_json, we’ll parse a JSON string representing information about different individuals.

Step-by-Step guide for JSON schema inference

Example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, schema_of_json
from pyspark.sql.types import StringType
# Initialize Spark Session
spark = SparkSession.builder.appName("schema_of_json_Example").getOrCreate()
# Sample JSON Data
json_data = [
    '{"name": "Sachin", "age": 30, "city": "Mumbai"}',
    '{"name": "Manju", "age": 25, "city": "Bangalore", "hobbies": ["Reading", "Traveling"]}',
    '{"name": "Ram", "age": 35, "city": "Hyderabad"}',
    '{"name": "Raju", "age": 28, "city": "Chennai", "hobbies": ["Cooking"]}',
    '{"name": "David", "age": 40, "city": "New York"}',
    '{"name": "Wilson", "age": 50, "city": "Washington"}'
]
# Creating DataFrame with JSON strings
df = spark.createDataFrame(json_data, StringType()).toDF("json_string")
# Inferring Schema
json_schema = schema_of_json(df.select("json_string").first()[0])
# Parsing JSON with inferred schema
df_parsed = df.withColumn("parsed", from_json(col("json_string"), json_schema))
# Show Results
df_parsed.select("parsed.*").show()

In this example, schema_of_json is used to infer the schema from the first JSON string in the DataFrame. Then, from_json is used to parse all JSON strings in the DataFrame using the inferred schema.

Output

+---+----------+------+
|age|      city|  name|
+---+----------+------+
| 30|    Mumbai|Sachin|
| 25| Bangalore| Manju|
| 35| Hyderabad|   Ram|
| 28|   Chennai|  Raju|
| 40|  New York| David|
| 50|Washington|Wilson|
+---+----------+------+

Spark important urls to refer

Post Views: 51

Author: user

How to derive the schema of a JSON string in PySpark

Implementing schema_of_json in PySpark

Step-by-Step guide for JSON schema inference

Trending

Recent Posts

Featured Posts – Slider Widget

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Most Viewed Posts

Implementing schema_of_json in PySpark

Step-by-Step guide for JSON schema inference

Related Articles

Trending

Recent Posts

Featured Posts – Slider Widget