PySpark : Identify the grouping level in data after performing a group by operation with cube or rollup in PySpark [grouping_id]

user May 23, 2023 Leave a Comment

pyspark.sql.functions.grouping_id(*cols)

This function is valuable when you need to identify the grouping level in data after performing a group by operation with cube or rollup. In this article, we will delve into the details of the grouping_id function and its usage with an example.

The grouping_id function signature in PySpark is as follows:

pyspark.sql.functions.grouping_id(*cols)

This function doesn’t require any argument, but it’s often used with columns in a DataFrame.

The grouping_id function is used in conjunction with the cube or rollup operations, and it provides an ID to indicate the level of grouping. The more columns the data is grouped by, the smaller the grouping ID will be.

Example Usage

Let’s go through a simple example to understand the usage of the grouping_id function.

Suppose we have a DataFrame named df containing three columns: ‘City’, ‘Product’, and ‘Sales’.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [("New York", "Apple", 100), 
        ("Los Angeles", "Orange", 200),
        ("New York", "Banana", 150), 
        ("Los Angeles", "Apple", 120),
        ("New York", "Orange", 75),
        ("Los Angeles", "Banana", 220)]
df = spark.createDataFrame(data, ["City", "Product", "Sales"])
df.show()

Result : DataFrame

+-----------+-------+-----+
|       City|Product|Sales|
+-----------+-------+-----+
|   New York|  Apple|  100|
|Los Angeles| Orange|  200|
|   New York| Banana|  150|
|Los Angeles|  Apple|  120|
|   New York| Orange|   75|
|Los Angeles| Banana|  220|
+-----------+-------+-----+

Now, let’s perform a cube operation on the ‘City’ and ‘Product’ columns and compute the total ‘Sales’ for each group. Also, let’s add a grouping_id column to identify the level of grouping.

from pyspark.sql.functions import sum, grouping_id
df_grouped = df.cube("City", "Product").agg(sum("Sales").alias("TotalSales"), grouping_id().alias("GroupingID"))
df_grouped.orderBy("GroupingID").show()

The orderBy function is used here to sort the result by the ‘GroupingID’ column. The output will look something like this:

+-----------+-------+----------+----------+
|       City|Product|TotalSales|GroupingID|
+-----------+-------+----------+----------+
|   New York| Banana|       150|         0|
|Los Angeles| Orange|       200|         0|
|Los Angeles|  Apple|       120|         0|
|   New York|  Apple|       100|         0|
|   New York| Orange|        75|         0|
|Los Angeles| Banana|       220|         0|
|   New York|   null|       325|         1|
|Los Angeles|   null|       540|         1|
|       null|  Apple|       220|         2|
|       null| Banana|       370|         2|
|       null| Orange|       275|         2|
|       null|   null|       865|         3|
+-----------+-------+----------+----------+

As you can see, the grouping_id function provides a numerical identifier that describes the level of grouping in the DataFrame, with smaller values corresponding to more columns being used for grouping.

The grouping_id function is a powerful tool for understanding the level of grouping in your data when using cube or rollup operations in PySpark. It provides valuable insights, especially when dealing with complex datasets with multiple levels of aggregation.

Spark important urls to refer

Post Views: 24

Author: user

PySpark : Identify the grouping level in data after performing a group by operation with cube or rollup in PySpark [grouping_id]

pyspark.sql.functions.grouping_id(*cols)

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Most Viewed Posts

pyspark.sql.functions.grouping_id(*cols)

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget