PySpark : Identify the grouping level in data after performing a group by operation with cube or rollup in PySpark [grouping_id]

PySpark @ Freshers.in

pyspark.sql.functions.grouping_id(*cols)

This function is valuable when you need to identify the grouping level in data after performing a group by operation with cube or rollup. In this article, we will delve into the details of the grouping_id function and its usage with an example.

The grouping_id function signature in PySpark is as follows:

pyspark.sql.functions.grouping_id(*cols)

This function doesn’t require any argument, but it’s often used with columns in a DataFrame.

The grouping_id function is used in conjunction with the cube or rollup operations, and it provides an ID to indicate the level of grouping. The more columns the data is grouped by, the smaller the grouping ID will be.

Example Usage

Let’s go through a simple example to understand the usage of the grouping_id function.

Suppose we have a DataFrame named df containing three columns: ‘City’, ‘Product’, and ‘Sales’.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [("New York", "Apple", 100), 
        ("Los Angeles", "Orange", 200),
        ("New York", "Banana", 150), 
        ("Los Angeles", "Apple", 120),
        ("New York", "Orange", 75),
        ("Los Angeles", "Banana", 220)]
df = spark.createDataFrame(data, ["City", "Product", "Sales"])
df.show()

Result : DataFrame

+-----------+-------+-----+
|       City|Product|Sales|
+-----------+-------+-----+
|   New York|  Apple|  100|
|Los Angeles| Orange|  200|
|   New York| Banana|  150|
|Los Angeles|  Apple|  120|
|   New York| Orange|   75|
|Los Angeles| Banana|  220|
+-----------+-------+-----+

Now, let’s perform a cube operation on the ‘City’ and ‘Product’ columns and compute the total ‘Sales’ for each group. Also, let’s add a grouping_id column to identify the level of grouping.

from pyspark.sql.functions import sum, grouping_id
df_grouped = df.cube("City", "Product").agg(sum("Sales").alias("TotalSales"), grouping_id().alias("GroupingID"))
df_grouped.orderBy("GroupingID").show()

The orderBy function is used here to sort the result by the ‘GroupingID’ column. The output will look something like this:

+-----------+-------+----------+----------+
|       City|Product|TotalSales|GroupingID|
+-----------+-------+----------+----------+
|   New York| Banana|       150|         0|
|Los Angeles| Orange|       200|         0|
|Los Angeles|  Apple|       120|         0|
|   New York|  Apple|       100|         0|
|   New York| Orange|        75|         0|
|Los Angeles| Banana|       220|         0|
|   New York|   null|       325|         1|
|Los Angeles|   null|       540|         1|
|       null|  Apple|       220|         2|
|       null| Banana|       370|         2|
|       null| Orange|       275|         2|
|       null|   null|       865|         3|
+-----------+-------+----------+----------+

As you can see, the grouping_id function provides a numerical identifier that describes the level of grouping in the DataFrame, with smaller values corresponding to more columns being used for grouping.

The grouping_id function is a powerful tool for understanding the level of grouping in your data when using cube or rollup operations in PySpark. It provides valuable insights, especially when dealing with complex datasets with multiple levels of aggregation.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply