PySpark – groupby with aggregation (count, sum, mean, min, max)

user August 25, 2021 Leave a Comment

pyspark.sql.DataFrame.groupBy

PySpark groupby functions groups the DataFrame using the specified columns to run aggregation ( count,sum,mean, min, max) on them. We can use groupby for single column and multiple column. There is no partial aggregation with group aggregate UDFs, i.e., a full shuffle is required. Also, all the data of a group will be loaded into memory, so the user should be aware of the potential Out of memory risk if data is skewed
and certain groups are too large to fit in memory.

Sample code with hardcoded data

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySpark groupby @ Freshers.in ").getOrCreate()
emp_schema= ["name","id","age","gender","dept","state","salary","increment"]
emp_data = [("Sam",1010,30,"Male","IT","CT",125000,8.1),
("Peter",1020,32,"Male","HR","NY",90000,7.0),
("Tom",1030,34,"Male","IT","CT",85000,5.0),
("Joe",1040,55,"Male","HR","GA",60000,4.0),
("Barry",1050,34,"Male","IT","OR",95000,6.0),
("Mike",1060,42,"Male","SALES","OR",70000,4.0),
("Suzy",1070,34,"Female","IT","GA",99000,6.0),
("Annie",1080,54,"Female","SALES","GA",80000,3.0),
("Bush",1090,34,"Male","IT","AL",85000,5.0),
("Evan",1100,44,"Female","CRM","GA",75000,8.0)]
df = spark.createDataFrame(data=emp_data,schema=emp_schema)
# PySpark groupby with sum
grp_by_dept = df.groupby('dept').agg({'salary' :'sum'})
grp_by_dept_sum.show()
+-----+-----------+
| dept|sum(salary)|
+-----+-----------+
|  CRM|      75000|
|SALES|     150000|
|   IT|     489000|
|   HR|     150000|
+-----+-----------+
# PySpark groupby with mean
grp_by_dept_mean = df.groupby('dept').agg({'salary' :'mean'}).show()
+-----+-----------+
| dept|avg(salary)|
+-----+-----------+
|  CRM|    75000.0|
|SALES|    75000.0|
|   IT|    97800.0|
|   HR|    75000.0|
+-----+-----------+
# PySpark groupby with min
grp_by_dept_mean = df.groupby('dept').agg({'salary' :'min'}).show()
+-----+-----------+
| dept|min(salary)|
+-----+-----------+
|  CRM|      75000|
|SALES|      70000|
|   IT|      85000|
|   HR|      60000|
+-----+-----------+
# PySpark groupby with max
grp_by_dept_mean = df.groupby('dept').agg({'salary' :'max'}).show()
+-----+-----------+
| dept|max(salary)|
+-----+-----------+
|  CRM|      75000|
|SALES|      80000|
|   IT|     125000|
|   HR|      90000|
+-----+-----------+
# PySpark groupby with count
grp_by_dept_mean = df.groupby('dept',).agg({'*' :'count'}).show()
+-----+--------+
| dept|count(1)|
+-----+--------+
|  CRM|       1|
|SALES|       2|
|   IT|       5|
|   HR|       2|
+-----+--------+
# PySpark groupby with multiple column
grp_by_dept_mean = df.groupby('dept','gender').agg({'salary' :'mean'}).show()
+-----+------+-----------+
| dept|gender|avg(salary)|
+-----+------+-----------+
|  CRM|Female|    75000.0|
|SALES|  Male|    70000.0|
|   IT|  Male|    97500.0|
|SALES|Female|    80000.0|
|   IT|Female|    99000.0|
|   HR|  Male|    75000.0|
+-----+------+-----------+

Reference.

Post Views: 1,880

Author: user

PySpark – groupby with aggregation (count, sum, mean, min, max)

pyspark.sql.DataFrame.groupBy

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

pyspark.sql.DataFrame.groupBy

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget