PySpark-What is map side join and How to perform map side join in Pyspark

PySpark @

Map-side join is a method of joining two datasets in PySpark where one dataset is broadcast to all executors, and then the join is performed in the same executor, instead of shuffling and sorting the data across multiple executors. This can significantly reduce the amount of data shuffling and improve performance for large datasets.

To perform a map-side join in PySpark, you can use the broadcast() function to broadcast one of the datasets, and then use the join() function to perform the join.

Here’s an example of how to perform a map-side join in PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

# Create a SparkSession
spark = SparkSession.builder.appName("Map-side join example").getOrCreate()

# Create two DataFrames
df1 = spark.createDataFrame([(1, "a"), (2, "b"), (3, "c")], ["id", "value"])
df2 = spark.createDataFrame([(1, "A"), (2, "B"), (3, "C")], ["id", "value"])

# Broadcast one of the DataFrames
broadcast_df = broadcast(df2)

# Perform the join
result = df1.join(broadcast(broadcast_df))

# Show the result

In the above example, df2 is broadcasted and the join is performed in the same executor where the broadcasted dataframe is present.


| id|value| id|value|
|  1|    a|  1|    A|
|  1|    a|  2|    B|
|  1|    a|  3|    C|
|  2|    b|  1|    A|
|  2|    b|  2|    B|
|  2|    b|  3|    C|
|  3|    c|  1|    A|
|  3|    c|  2|    B|
|  3|    c|  3|    C|


It’s worth noting that map-side join is efficient when the data size of one dataset is small enough to fit in memory. Also, broadcast join is not recommended when the size of data is too large, it can cause out of memory issues.

It’s also worth noting that you should use this method with caution, as broadcasting large datasets can cause out-of-memory errors on the executors.

Make sure that the join column is indexed and has a small size, otherwise it will cause a slow join.

Author: user

Leave a Reply