Grouping and aggregating multi-column data with PySpark – Complete example included

user September 27, 2023 Leave a Comment

The groupBy function is widely used in PySpark SQL to group the DataFrame based on one or multiple columns, apply aggregation functions, and perform analysis. Let’s delve into a detailed example using PySpark’s groupBy function, focusing on grouping by multiple columns. In this article, we will learn how to perform GroupBy operations on multiple columns using PySpark SQL. We initialized a SparkSession, created a DataFrame named freshers_views representing freshers’ opinions on various technologies, performed a GroupBy operation on Technology and Rating columns, and aggregated the views.

Let’s assume we have a DataFrame named freshers_views that contains the opinions of freshers (new employees) on various technologies.

Here is the sample data:

+---------+------------+-------+--------+
|   Name  | Technology | Views | Rating |
+---------+------------+-------+--------+
| Amar    | Python     |   120 | 4.5    |
| Bobby   | Java       |   100 | 4.0    |
| Chinju  | Python     |   150 | 4.7    |
| Dravid  | Scala      |    80 | 3.8    |
| Sachin  | Java       |   120 | 4.2    |
+---------+------------+-------+--------+

Step 1: Initialize SparkSession

Before we start, we need to initialize the SparkSession, which is the entry point to any Spark functionality in PySpark.

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("Multiple Columns GroupBy Example") \
    .getOrCreate()

Step 2: Creating DataFrame

Now, let’s create the freshers_views DataFrame using the sample data provided above.

from pyspark.sql import Row
data = [
    Row(Name='Amar', Technology='Python', Views=120, Rating=4.5),
    Row(Name='Bobby', Technology='Java', Views=100, Rating=4.0),
    Row(Name='Chinju', Technology='Python', Views=150, Rating=4.7),
    Row(Name='Dravid', Technology='Scala', Views=80, Rating=3.8),
    Row(Name='Sachin', Technology='Java', Views=120, Rating=4.2)
]

freshers_views = spark.createDataFrame(data)

Step 3: GroupBy on multiple columns

Now, let’s perform a GroupBy operation on the Technology and Rating columns, and aggregate the Views column using the sum function.

from pyspark.sql import functions as F

result_df = freshers_views.groupBy("Technology", "Rating") \
    .agg(F.sum("Views").alias("Total_Views")) \
    .sort("Technology", "Rating")

Output

Finally, let’s display the results using the show() method.

result_df.show()

+----------+------+-----------+
|Technology|Rating|Total_Views|
+----------+------+-----------+
|      Java|   4.0|        100|
|      Java|   4.2|        120|
|    Python|   4.5|        120|
|    Python|   4.7|        150|
|     Scala|   3.8|         80|
+----------+------+-----------+

Spark important urls to refer

Post Views: 4

Author: user

Grouping and aggregating multi-column data with PySpark – Complete example included

Step 1: Initialize SparkSession

Step 2: Creating DataFrame

Step 3: GroupBy on multiple columns

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

Step 1: Initialize SparkSession

Step 2: Creating DataFrame

Step 3: GroupBy on multiple columns

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget