Counting Null or None or Missing values with Precision in PySpark.

user November 24, 2023

This article provides a comprehensive guide on how to accomplish this, a crucial step in data cleaning and preprocessing. Identifying and counting missing values (null, None, NaN) in a dataset is crucial for:

Data Quality Assessment: Understanding the extent of missing data to evaluate data quality.
Data Cleaning: Informing the strategy for handling missing data, like imputation or deletion.
Analytical Accuracy: Ensuring accurate analysis by acknowledging data incompleteness.

Counting missing values in PySpark

PySpark provides functions to efficiently count null, None, and NaN values in DataFrames. Let’s walk through a method to perform this task.

Step-by-step guide

Example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, when, isnan
# Initialize Spark Session
spark = SparkSession.builder.appName("CountMissingValues").getOrCreate()
# Sample Data
data = [
    ("Sachin", None, 35),
    ("Manju", "Female", None),
    ("Ram", "Male", 40),
    ("Raju", None, None),
    ("David", "Male", 50),
    ("Wilson", "Male", None)
]
columns = ["Name", "Gender", "Age"]
# Creating DataFrame
df = spark.createDataFrame(data, columns)
# Counting Null, None, NaN Values
null_counts = df.select([count(when(col(c).isNull() | isnan(col(c)), c)).alias(c) for c in df.columns])
# Show Results
null_counts.show()

Output

+----+------+---+
|Name|Gender|Age|
+----+------+---+
|   1|     2|  3|
+----+------+---+

In this example, we use when, col, isNull, and isnan functions from PySpark to count null, None, and NaN values across all columns of the DataFrame.

Spark important urls to refer

Post Views: 11

Author: user

Counting Null or None or Missing values with Precision in PySpark.

Counting missing values in PySpark

Step-by-step guide

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

Counting missing values in PySpark

Step-by-step guide

Related Articles

Trending

Recent Posts

Featured Posts – Slider Widget