Grouping and aggregating multi-column data with PySpark – Complete example included

PySpark @ Freshers.in

The groupBy function is widely used in PySpark SQL to group the DataFrame based on one or multiple columns, apply aggregation functions, and perform analysis. Let’s delve into a detailed example using PySpark’s groupBy function, focusing on grouping by multiple columns. In this article, we will learn how to perform GroupBy operations on multiple columns using PySpark SQL. We initialized a SparkSession, created a DataFrame named freshers_views representing freshers’ opinions on various technologies, performed a GroupBy operation on Technology and Rating columns, and aggregated the views.

Let’s assume we have a DataFrame named freshers_views that contains the opinions of freshers (new employees) on various technologies.

Here is the sample data:

+---------+------------+-------+--------+
|   Name  | Technology | Views | Rating |
+---------+------------+-------+--------+
| Amar    | Python     |   120 | 4.5    |
| Bobby   | Java       |   100 | 4.0    |
| Chinju  | Python     |   150 | 4.7    |
| Dravid  | Scala      |    80 | 3.8    |
| Sachin  | Java       |   120 | 4.2    |
+---------+------------+-------+--------+

Step 1: Initialize SparkSession

Before we start, we need to initialize the SparkSession, which is the entry point to any Spark functionality in PySpark.

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("Multiple Columns GroupBy Example") \
    .getOrCreate()

Step 2: Creating DataFrame

Now, let’s create the freshers_views DataFrame using the sample data provided above.

from pyspark.sql import Row
data = [
    Row(Name='Amar', Technology='Python', Views=120, Rating=4.5),
    Row(Name='Bobby', Technology='Java', Views=100, Rating=4.0),
    Row(Name='Chinju', Technology='Python', Views=150, Rating=4.7),
    Row(Name='Dravid', Technology='Scala', Views=80, Rating=3.8),
    Row(Name='Sachin', Technology='Java', Views=120, Rating=4.2)
]

freshers_views = spark.createDataFrame(data)

Step 3: GroupBy on multiple columns

Now, let’s perform a GroupBy operation on the Technology and Rating columns, and aggregate the Views column using the sum function.

from pyspark.sql import functions as F

result_df = freshers_views.groupBy("Technology", "Rating") \
    .agg(F.sum("Views").alias("Total_Views")) \
    .sort("Technology", "Rating")

Output

Finally, let’s display the results using the show() method.

result_df.show()
+----------+------+-----------+
|Technology|Rating|Total_Views|
+----------+------+-----------+
|      Java|   4.0|        100|
|      Java|   4.2|        120|
|    Python|   4.5|        120|
|    Python|   4.7|        150|
|     Scala|   3.8|         80|
+----------+------+-----------+

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply