PySpark Complex Data Types: ArrayType, MapType, StructField, and StructType

PySpark @ Freshers.in

In this comprehensive guide, we will explore four essential PySpark data types: ArrayType, MapType, StructField, and StructType. You’ll learn their applications, use cases, and how to leverage them effectively in your PySpark projects.

Understanding Complex Data Types

Complex data types in PySpark provide flexibility when dealing with structured and semi-structured data. They enable you to work with nested, hierarchical, or multi-valued data, making PySpark a versatile tool for data engineering and analysis.

1. ArrayType: Handling Lists or Arrays

The ArrayType in PySpark allows you to store lists or arrays of values within a single column. It’s useful for scenarios where you need to work with collections of data, such as tags associated with articles, phone numbers, or email addresses.

Example: Storing Email Addresses

Let’s consider a scenario where you want to store email addresses for a list of individuals:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
# Initialize SparkSession
spark = SparkSession.builder.appName("ArrayType @ Freshers.in Learning Example").getOrCreate()
# Create a sample dataframe
data = [("Sachin", ["sachin@example.com", "sachin@gmail.com"]),
        ("Manju", ["manju@example.com"]),
        ("Ram", ["ram@example.com", "ram@gmail.com"]),
        ("Raju", ["raju@example.com"]),
        ("David", [])]
schema = StructType([StructField("Name", StringType(), True),
                     StructField("Emails", ArrayType(StringType()), True)])
df = spark.createDataFrame(data, schema)
# Show the dataframe
df.show(20,False)

Output

+------+--------------------------------------+
|Name  |Emails                                |
+------+--------------------------------------+
|Sachin|[sachin@example.com, sachin@gmail.com]|
|Manju |[manju@example.com]                   |
|Ram   |[ram@example.com, ram@gmail.com]      |
|Raju  |[raju@example.com]                    |
|David |[]                                    |
+------+--------------------------------------+

In this example, we use ArrayType to store multiple email addresses for each individual.

2. MapType: Managing Key-Value Pairs

The MapType data type in PySpark is perfect for handling key-value pairs within a single column. It’s commonly used for representing attributes, properties, or metadata associated with records.

Example: Storing User Preferences

Imagine a scenario where you want to store user preferences, such as favorite genres and languages:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, MapType

# Initialize SparkSession
spark = SparkSession.builder.appName("MapType @ Freshers.in Learning Example").getOrCreate()

# Create a sample dataframe
data = [("Sachin", {"Genre": "Action", "Language": "English"}),
        ("Manju", {"Genre": "Comedy"}),
        ("Ram", {"Genre": "Drama", "Language": "Spanish"}),
        ("Raju", {}),
        ("David", {"Genre": "Science Fiction"})]

schema = StructType([StructField("Name", StringType(), True),
                     StructField("Preferences", MapType(StringType(), StringType()), True)])

df = spark.createDataFrame(data, schema)

# Show the dataframe
df.show(20,False)

Output

+------+--------------------------------------+
|Name  |Preferences                           |
+------+--------------------------------------+
|Sachin|{Language -> English, Genre -> Action}|
|Manju |{Genre -> Comedy}                     |
|Ram   |{Language -> Spanish, Genre -> Drama} |
|Raju  |{}                                    |
|David |{Genre -> Science Fiction}            |
+------+--------------------------------------+

In this example, we use MapType to store user preferences as key-value pairs.

3. StructField: Defining Column Structure

The StructField data type allows you to define the structure of a column within a PySpark dataframe. It’s commonly used in conjunction with StructType to create complex, nested structures.

Example: Creating a StructField

Let’s consider a scenario where you want to represent contact information for individuals, including their phone numbers and addresses:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, StructField, StructType
# Initialize SparkSession
spark = SparkSession.builder.appName("StructField @ Freshers.in Learning Example").getOrCreate()
# Create a sample dataframe
data = [("Sachin", ("123-456-7890", "123 Main St")),
        ("Manju", ("987-654-3210", "456 Elm St")),
        ("Ram", ("555-123-4567", "789 Oak Ave")),
        ("Raju", ("", "")),
        ("David", ("111-222-3333", "321 Pine Rd"))]
schema = StructType([StructField("Name", StringType(), True),
                     StructField("Contact", StructType([
                         StructField("Phone", StringType(), True),
                         StructField("Address", StringType(), True)
                     ]), True)])
df = spark.createDataFrame(data, schema)
# Show the dataframe
df.show(20,False)

Output

+------+---------------------------+
|Name  |Contact                    |
+------+---------------------------+
|Sachin|{123-456-7890, 123 Main St}|
|Manju |{987-654-3210, 456 Elm St} |
|Ram   |{555-123-4567, 789 Oak Ave}|
|Raju  |{, }                       |
|David |{111-222-3333, 321 Pine Rd}|
+------+---------------------------+

In this example, we use StructField to define the structure of the “Contact” column, which includes “Phone” and “Address” subfields.

4. StructType: Creating Complex Structures

The StructType data type allows you to create complex, nested structures within a PySpark dataframe. It’s a powerful tool for modeling hierarchical data.

Example: Modeling Employee Data

Suppose you want to model employee data, including personal information and job details:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType, StructType

# Initialize SparkSession
spark = SparkSession.builder.appName("StructType @ Freshers.in Learning Example").getOrCreate()

# Create a sample dataframe
data = [("Sachin", ("Male", 30, "2022-01-01")),
        ("Manju", ("Female", 28, "2021-05-15")),
        ("Ram", ("Male", 35, "2020-10-10")),
        ("Raju", ("Male", 40, "2019-07-01")),
        ("David", ("Male", 45, "2018-02-20"))]

schema = StructType([StructField("Name", StringType(), True),
                     StructField("Details", StructType([
                         StructField("Gender", StringType(), True),
                         StructField("Age", IntegerType(), True),
                         StructField("HireDate", StringType(), True)
                     ]), True)])

df = spark.createDataFrame(data, schema)

# Show the dataframe
df.show(20,False)

Output

+------+------------------------+
|Name  |Details                 |
+------+------------------------+
|Sachin|{Male, 30, 2022-01-01}  |
|Manju |{Female, 28, 2021-05-15}|
|Ram   |{Male, 35, 2020-10-10}  |
|Raju  |{Male, 40, 2019-07-01}  |
|David |{Male, 45, 2018-02-20}  |
+------+------------------------+

In this example, we use StructType to create a complex structure for employee data, including personal details.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user