Counting Null or None or Missing values with Precision in PySpark.

PySpark @ Freshers.in

This article provides a comprehensive guide on how to accomplish this, a crucial step in data cleaning and preprocessing. Identifying and counting missing values (null, None, NaN) in a dataset is crucial for:

  1. Data Quality Assessment: Understanding the extent of missing data to evaluate data quality.
  2. Data Cleaning: Informing the strategy for handling missing data, like imputation or deletion.
  3. Analytical Accuracy: Ensuring accurate analysis by acknowledging data incompleteness.

Counting missing values in PySpark

PySpark provides functions to efficiently count null, None, and NaN values in DataFrames. Let’s walk through a method to perform this task.

Step-by-step guide

Example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, when, isnan
# Initialize Spark Session
spark = SparkSession.builder.appName("CountMissingValues").getOrCreate()
# Sample Data
data = [
    ("Sachin", None, 35),
    ("Manju", "Female", None),
    ("Ram", "Male", 40),
    ("Raju", None, None),
    ("David", "Male", 50),
    ("Wilson", "Male", None)
]
columns = ["Name", "Gender", "Age"]
# Creating DataFrame
df = spark.createDataFrame(data, columns)
# Counting Null, None, NaN Values
null_counts = df.select([count(when(col(c).isNull() | isnan(col(c)), c)).alias(c) for c in df.columns])
# Show Results
null_counts.show()
Output
+----+------+---+
|Name|Gender|Age|
+----+------+---+
|   1|     2|  3|
+----+------+---+
In this example, we use when, col, isNull, and isnan functions from PySpark to count null, None, and NaN values across all columns of the DataFrame.
Author: user