How to derive the schema of a JSON string in PySpark

PySpark @ Freshers.in

The schema_of_json function in PySpark is used to derive the schema of a JSON string. This schema can then be used to parse JSON data in DataFrames effectively. It is especially useful when dealing with semi-structured JSON data where the schema might not be consistent or known in advance.

Advantages of using schema_of_json

  1. Schema Inference: Automatically infers the schema from JSON data.
  2. Flexibility: Handles varying and nested JSON structures.
  3. Efficiency: Improves parsing speed by understanding the data structure beforehand.

Implementing schema_of_json in PySpark

To demonstrate the use of schema_of_json, we’ll parse a JSON string representing information about different individuals.

Step-by-Step guide for JSON schema inference

Example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, schema_of_json
from pyspark.sql.types import StringType
# Initialize Spark Session
spark = SparkSession.builder.appName("schema_of_json_Example").getOrCreate()
# Sample JSON Data
json_data = [
    '{"name": "Sachin", "age": 30, "city": "Mumbai"}',
    '{"name": "Manju", "age": 25, "city": "Bangalore", "hobbies": ["Reading", "Traveling"]}',
    '{"name": "Ram", "age": 35, "city": "Hyderabad"}',
    '{"name": "Raju", "age": 28, "city": "Chennai", "hobbies": ["Cooking"]}',
    '{"name": "David", "age": 40, "city": "New York"}',
    '{"name": "Wilson", "age": 50, "city": "Washington"}'
]
# Creating DataFrame with JSON strings
df = spark.createDataFrame(json_data, StringType()).toDF("json_string")
# Inferring Schema
json_schema = schema_of_json(df.select("json_string").first()[0])
# Parsing JSON with inferred schema
df_parsed = df.withColumn("parsed", from_json(col("json_string"), json_schema))
# Show Results
df_parsed.select("parsed.*").show()

In this example, schema_of_json is used to infer the schema from the first JSON string in the DataFrame. Then, from_json is used to parse all JSON strings in the DataFrame using the inferred schema.

Output

+---+----------+------+
|age|      city|  name|
+---+----------+------+
| 30|    Mumbai|Sachin|
| 25| Bangalore| Manju|
| 35| Hyderabad|   Ram|
| 28|   Chennai|  Raju|
| 40|  New York| David|
| 50|Washington|Wilson|
+---+----------+------+

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user