PySpark : Skipping Sundays in Date Computations

PySpark @ Freshers.in

When working with data in fields such as finance or certain business operations, it’s often the case that weekends or specific days of the week, such as Sundays, are considered non-working days or holidays. In these situations, you might need to compute the next business day from a given date or timestamp, excluding these non-working days. This article will walk you through the process of accomplishing this task using PySpark, the Python library for Apache Spark. We’ll provide a detailed example to ensure a clear understanding of this operation.

Setting Up the Environment

Firstly, we need to set up our PySpark environment. Assuming you’ve properly installed Spark and PySpark, you can initialize a SparkSession as follows:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("Freshers.in Learning @ Skipping Sundays in Date Computations") \
    .getOrCreate()

Understanding date_add, date_format Functions and Conditional Statements

The functions we’ll be using in this tutorial are PySpark’s built-in date_add and date_format functions, along with the when function for conditional logic. The date_add function adds a number of days to a date or timestamp, while the date_format function converts a date or timestamp to a string based on a given format. The when function allows us to create a new column based on conditional logic.

Creating a DataFrame with Timestamps

Let’s start by creating a DataFrame that contains some sample timestamps:

from pyspark.sql import functions as F
from pyspark.sql.types import TimestampType
data = [("2023-01-14 13:45:30",), ("2023-02-25 08:20:00",), ("2023-07-07 22:15:00",), ("2023-07-08 22:15:00",)]
df = spark.createDataFrame(data, ["Timestamp"])
df = df.withColumn("Timestamp", F.col("Timestamp").cast(TimestampType()))
df.show(truncate=False)
+-------------------+
|Timestamp          |
+-------------------+
|2023-01-14 13:45:30|
|2023-02-25 08:20:00|
|2023-07-07 22:15:00|
|2023-07-08 22:15:00|
+-------------------+

Getting the Next Day Excluding Sundays

To get the next day from each timestamp, excluding Sundays, we first use the date_add function to compute the next day. Then we use date_format to get the day of the week. If this day is a Sunday, we use date_add again to get the following day:

df = df.withColumn("Next_Day", F.date_add(F.col("Timestamp"), 1))
df = df.withColumn("Next_Day", 
                   F.when(F.date_format(F.col("Next_Day"), "EEEE") == "Sunday", 
                          F.date_add(F.col("Next_Day"), 1))
                   .otherwise(F.col("Next_Day")))
df.show(truncate=False)

ResultĀ 

+-------------------+----------+
|Timestamp          |Next_Day  |
+-------------------+----------+
|2023-01-14 13:45:30|2023-01-16|
|2023-02-25 08:20:00|2023-02-27|
|2023-07-07 22:15:00|2023-07-08|
|2023-07-08 22:15:00|2023-07-10|
+-------------------+----------+

In the Next_Day column, you’ll see that if the next day would have been a Sunday, it has been replaced with the following Monday.

The use of date_add, date_format, and conditional logic with when function enables us to easily compute the next business day from a given date or timestamp, while excluding non-working days like Sundays.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply