Window Functions in PySpark

user December 20, 2023

In this comprehensive guide, we’ll delve into what Window Functions are, how they work in PySpark, and provide real-world examples for practical understanding.

Understanding Window Functions

Window Functions, also known as windowing or analytic functions, perform calculations across a set of table rows related to the current row. Unlike traditional aggregate functions, Window Functions don’t collapse the result set into a single value; instead, they provide insights into specific windows of data.

Key Components of Window Functions

Before diving into examples, let’s understand the key components of Window Functions:

PARTITION BY: This clause divides the result set into partitions to perform calculations on distinct subsets of data.
ORDER BY: It determines the order in which rows within each partition are processed. This is essential for defining the window frame.
Window Frame: The window frame specifies the range of rows over which a calculation is performed. You can use the frame to define preceding and following rows relative to the current row.

Real-world Example: Calculating Employee Salaries

Imagine you have a dataset of employee salaries and you want to calculate each employee’s salary rank within their department based on their experience (years of service). Here’s how you can achieve this using Window Functions in PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
import pyspark.sql.functions as F
# Initialize Spark
spark = SparkSession.builder.appName("Learning @ freshers.in  EmployeeSalaries").getOrCreate()
# Sample data
data = [("Sachin", "HR", 50000, 5),
        ("Manju", "Finance", 60000, 4),
        ("Ram", "HR", 55000, 6),
        ("Raju", "Finance", 70000, 7),
        ("David", "HR", 48000, 3),
        ("Freshers_In", "Finance", 35000, 1),
        ("Wilson", "HR", 75000, 8)]
columns = ["Name", "Department", "Salary", "YearsOfService"]
# Create a DataFrame
df = spark.createDataFrame(data, columns)
# Define a Window specification
window_spec = Window.partitionBy("Department").orderBy(F.desc("YearsOfService"))
# Calculate salary rank within each department based on experience
df = df.withColumn("SalaryRank", F.rank().over(window_spec))
df.show()

We use the rank() Window Function to calculate the salary rank for employees within each department based on their years of service.

Output

+-----------+----------+------+--------------+----------+
|       Name|Department|Salary|YearsOfService|SalaryRank|
+-----------+----------+------+--------------+----------+
|       Raju|   Finance| 70000|             7|         1|
|      Manju|   Finance| 60000|             4|         2|
|Freshers_In|   Finance| 35000|             1|         3|
|     Wilson|        HR| 75000|             8|         1|
|        Ram|        HR| 55000|             6|         2|
|     Sachin|        HR| 50000|             5|         3|
|      David|        HR| 48000|             3|         4|
+-----------+----------+------+--------------+----------+

Spark important urls to refer

Post Views: 14

Author: user

Window Functions in PySpark

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

Related Articles

Trending

Recent Posts

Featured Posts – Slider Widget