PySpark : Extract values from JSON strings within a DataFrame in PySpark [json_tuple]

PySpark @ Freshers.in

pyspark.sql.functions.json_tuple

PySpark provides a powerful function called json_tuple that allows you to extract values from JSON strings within a DataFrame. This function is particularly useful when you’re working with JSON data and need to retrieve specific values or attributes from the JSON structure. In this article, we will explore the json_tuple function in PySpark and demonstrate its usage with an example.

Understanding json_tuple

The json_tuple function in PySpark extracts the values of specified attributes from JSON strings within a DataFrame. It takes two or more arguments: the first argument is the input column containing JSON strings, and the subsequent arguments are the attribute names you want to extract from the JSON.

The json_tuple function returns a tuple of columns, where each column represents the extracted value of the corresponding attribute from the JSON string.

Example Usage

Let’s dive into an example to understand how to use json_tuple in PySpark. Consider the following sample data:

from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Sample data as a DataFrame
data = [
    ('{"name": "Sachin", "age": 30}',),
    ('{"name": "Narendra", "age": 25}',),
    ('{"name": "Jacky", "age": 40}',)
]
df = spark.createDataFrame(data, ['json_data'])
# Show the DataFrame
df.show(truncate=False)

Output:

+-----------------------+
|json_data              |
+-----------------------+
|{"name": "Sachin", "age": 30}|
|{"name": "Narendra", "age": 25}|
|{"name": "Jacky", "age": 40} |
+-----------------------+

In this example, we have a DataFrame named df with a single column called ‘json_data’, which contains JSON strings representing people’s information.

Now, let’s use the json_tuple function to extract the values of the ‘name’ and ‘age’ attributes from the JSON strings:

from pyspark.sql.functions import json_tuple
# Extract 'name' and 'age' attributes using json_tuple
extracted_data = df.select(json_tuple('json_data', 'name', 'age').alias('name', 'age'))
# Show the extracted data
extracted_data.show(truncate=False)

Output

+----+---+
|name|age|
+----+---+
|Sachin|30 |
|Narendra|25 |
|Jacky |40 |
+----+---+

In the above code, we use the json_tuple function to extract the ‘name’ and ‘age’ attributes from the ‘json_data’ column. We specify the attribute names as arguments to json_tuple (‘name’ and ‘age’), and use the alias method to assign meaningful column names to the extracted attributes.

The resulting extracted_data DataFrame contains two columns: ‘name’ and ‘age’ with the extracted values from the JSON strings.

The json_tuple function in PySpark is a valuable tool for working with JSON data in DataFrames. It allows you to extract specific attributes or values from JSON strings efficiently. By leveraging the power of json_tuple, you can easily process and analyze JSON data within your PySpark pipelines, gaining valuable insights from structured JSON information.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply