PySpark : Unpivot a DataFrame from wide format to long format : melt

Spark_Pandas_Freshers_in

Apache Spark has emerged as a dominant force in the realm of big data processing, offering unparalleled scalability and performance. While Spark provides robust functionality for distributed data processing, leveraging familiar tools like Pandas can streamline data manipulations and SQL operations. In this article, we explore how to harness the power of Pandas API on Spark, focusing on the ‘melt’ function to unpivot DataFrames effortlessly. The ‘melt’ function in Pandas API on Spark provides a powerful tool for reshaping DataFrames, allowing users to effortlessly pivot between wide and long formats.

Introducing Pandas API on Spark

Pandas is a popular Python library for data manipulation and analysis, renowned for its ease of use and powerful functionality. Leveraging Pandas API on Spark allows users to apply Pandas operations directly to Spark DataFrames, providing a seamless transition for those familiar with Pandas syntax.

One notable function available in both Pandas and Spark is ‘melt’, which is used to unpivot DataFrames from wide format to long format. Let’s delve into how ‘melt’ works and explore practical examples to understand its functionality.

Understanding the ‘melt’ Function

The ‘melt’ function in Pandas (and now available in Spark through Pandas API) reshapes a DataFrame from wide format to long format, optionally leaving identifier variables set. This operation is commonly used to transform datasets for easier analysis and visualization.

Syntax:

melt(frame, id_vars=None, value_vars=None, var_name='variable', value_name='value')

frame: The DataFrame to be melted.

id_vars: Columns to use as identifier variables (optional).

value_vars: Columns to unpivot (optional).

var_name: Name to use for the variable column (default is ‘variable’).

value_name: Name to use for the value column (default is ‘value’).

Example: Unpivoting a DataFrame with ‘melt’

Consider the following DataFrame representing sales data:

from pyspark.sql import SparkSession
import pandas as pd
# Create a SparkSession
spark = SparkSession.builder \
    .appName("Pandas API on Spark Example") \
    .getOrCreate()
# Sample data
data = [('Jan', 100, 120),
        ('Feb', 150, 160),
        ('Mar', 200, 180)]
# Create a Spark DataFrame
df = spark.createDataFrame(data, ['Month', 'Product_A', 'Product_B'])
# Convert Spark DataFrame to Pandas DataFrame
pandas_df = df.toPandas()
print("Original DataFrame:")
print(pandas_df)
# Unpivot the DataFrame using Pandas 'melt' function
melted_df = pd.melt(pandas_df, id_vars=['Month'], var_name='Product', value_name='Sales')
print("\nDataFrame after melting:")
print(melted_df)

Output:

   Month  Product_A  Product_B
0    Jan        100        120
1    Feb        150        160
2    Mar        200        180

Now, let’s use the ‘melt’ function to unpivot the DataFrame:

DataFrame after melting:
  Month    Product  Sales
0   Jan  Product_A    100
1   Feb  Product_A    150
2   Mar  Product_A    200
3   Jan  Product_B    120
4   Feb  Product_B    160
5   Mar  Product_B    180
Author: user