In PySpark, managing data across different partitions is crucial for optimizing performance, especially for large-scale data processing tasks. Two methods used to modify the partitioning of a DataFrame are repartition() and coalesce(). Here, we’ll break down the differences between these methods and provide examples for clarity.
repartition()
The repartition() method is used to increase or decrease the number of partitions in a DataFrame. When you use repartition(), it involves a full shuffle of the data, meaning all the data is reshuffled to create new partitions, which can be resource-intensive.
coalesce()
On the other hand, coalesce() is used to reduce the number of partitions in a DataFrame. Unlike repartition(), coalesce() performs a narrow transformation, meaning it doesn’t shuffle all the data; instead, it merges existing partitions to avoid a full shuffle, thus being more efficient.
Let’s consider a simple DataFrame fruit_sales that represents sales data of different fruits:
+-------+------+
| Fruit|Sales |
+-------+------+
| Berry | 100|
| coconut| 150|
| Orange | 200|
| Grape | 50 |
| Cherry | 120|
+-------+------+
Create sample dataframe with the above data
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Repartition and coalesce @ Freshers.in Learning ") \
.getOrCreate()
from pyspark.sql import Row
data = [
Row(Fruit='Berry', Sales=100),
Row(Fruit='coconut', Sales=150),
Row(Fruit='Orange', Sales=200),
Row(Fruit='Grape', Sales=50),
Row(Fruit='Cherry', Sales=120)
]
fruit_sales = spark.createDataFrame(data)
repartition()
Assume fruit_sales initially has 2 partitions. If we want to redistribute the data into 3 partitions, we use repartition().
repartitioned_df = fruit_sales.repartition(3)
This will cause a full shuffle of the data and redistribute it across 3 partitions. While this method can be used to increase parallelism, it can be expensive in terms of resources due to the full shuffle of the data.
coalesce()
If we want to reduce the number of partitions from 3 to 1, we can use coalesce().
coalesced_df = repartitioned_df.coalesce(1)
coalesce() will efficiently merge the existing 3 partitions into 1 partition without performing a full shuffle of the data, making it more efficient than repartition() when reducing the number of partitions.
Visualization
If fruit_sales initially has two partitions represented as:
Partition 1: Berry, coconut
Partition 2: Orange, Grape, Cherry
After applying repartition(3), the DataFrame might be reshuffled to:
Partition 1: Berry, Cherry
Partition 2: Orange, Grape
Partition 3: coconut
If we then apply coalesce(1), the partitions will be merged without shuffling the data:
Partition 1: Berry, Cherry, Orange, Grape, Banana
When to use repartition() and coalesce()
Use repartition() when:
You need to increase the number of partitions.
You require a full shuffle of the data, typically when you have skewed data.
Use coalesce() when:
You need to decrease the number of partitions.
You want to avoid a full shuffle to save resources.
Spark important urls to refer