Handling Missing or Null Values in PySpark: Strategies and Examples

user February 16, 2024

Dealing with missing or null values is a common challenge in data preprocessing and cleaning tasks. PySpark, the Python API for Apache Spark, offers several techniques to handle missing values efficiently. In this article, we’ll explore different strategies for handling missing or null values in PySpark, along with practical examples and outputs. PySpark provides various methods for handling missing or null values in DataFrame.

1. Dropping Rows with Null Values:

One approach to handle missing values is to simply drop rows containing null values from the DataFrame using the dropna() method.

# Importing PySpark modules
from pyspark.sql import SparkSession
# Creating a SparkSession
spark = SparkSession.builder.appName("HandleNullValues").getOrCreate()
# Creating a DataFrame with null values
data = [(1, "Alice", None), (2, "Bob", 25), (3, None, 30)]
df = spark.createDataFrame(data, ["ID", "Name", "Age"])
# Dropping rows with null values
cleaned_df = df.dropna()
# Displaying the cleaned DataFrame
cleaned_df.show()

Output:

+---+----+---+
| ID|Name|Age|
+---+----+---+
|  2| Bob| 25|
+---+----+---+

2. Filling Null Values with a Specific Value:

Another approach is to fill null values in specific columns with a predefined value using the fillna() method.

# Filling null values with a specific value
filled_df = df.fillna({"Name": "Unknown", "Age": 0})
# Displaying the DataFrame with filled values
filled_df.show()

Output:

+---+-------+---+
| ID|   Name|Age|
+---+-------+---+
|  1|  Alice|  0|
|  2|    Bob| 25|
|  3|Unknown| 30|
+---+-------+---+

3. Imputing Null Values with Mean or Median:

Imputing null values with the mean or median of the respective column is another commonly used technique.

# Importing PySpark modules
from pyspark.ml.feature import Imputer
# Creating an Imputer object
imputer = Imputer(strategy="mean", inputCols=["Age"], outputCols=["Age_imputed"])
# Fitting the imputer model
imputer_model = imputer.fit(df)
# Transforming the DataFrame to impute null values
imputed_df = imputer_model.transform(df)
# Displaying the DataFrame with imputed values
imputed_df.show()

Output:

+---+----+----+-----------+
| ID|Name| Age|Age_imputed|
+---+----+----+-----------+
|  1|null|null|       27.5|
|  2| Bob|  25|         25|
|  3|null|  30|         30|
+---+----+----+-----------+

Spark important urls to refer

Post Views: 12

Author: user

Handling Missing or Null Values in PySpark: Strategies and Examples

Trending

Recent Posts

Featured Posts – Slider Widget

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Most Viewed Posts

Related Articles

Trending

Recent Posts

Featured Posts – Slider Widget