PySpark : Identifying Data Skewness and Partition Row Counts in PySpark

PySpark @

Data skewness is a common issue in large scale data processing. It happens when data is not evenly distributed across partitions. This can cause certain tasks to run longer than others, which is not efficient for parallel processing systems like Apache Spark.

This article will provide a step-by-step guide on how to identify data skewness by getting the count of rows from each partition using PySpark. 

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("DataSkewness").getOrCreate()

Step 2: Creating Sample Data

Let’s create a DataFrame df to¬†represent our website traffic data:

from pyspark.sql import Row
# Assume the data represents the visitor count of each page on the website
data = [("Home", 40000), ("About", 5000), ("Contact", 3000), ("Services", 20000), ("Blog", 10000)]
df = spark.createDataFrame(data, ["Page", "VisitorCount"])

Step 3: Repartitioning Data

For the purpose of this demonstration, we will repartition our DataFrame into 3 partitions using the repartition() function:

df_repartitioned = df.repartition(3)

Step 4: Counting Rows per Partition

Now, let’s define a function to calculate the number of rows in each partition:

def partition_count(df):
    return df.rdd.mapPartitionsWithIndex(lambda index, iterator: [(index, len(list(iterator)))]).collect()

In this function, we use mapPartitionsWithIndex() to create a new RDD where each element is a tuple containing the index of the partition and the number of rows in that partition. We then collect this data back to the driver program with collect().

The output of the above function might look like this:

[(0, 2), (1, 1), (2, 2)]

The tuples represent the partition index and the number of rows in each partition, respectively. This is where you may notice skewness. If data were perfectly distributed, each partition would have an equal (or close to equal) number of rows. In our case, we see that the first partition has two rows while the second only has one.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply