PySpark : Subtracting a specified number of days from a given date in PySpark [date_sub]

PySpark @ Freshers.in

In this article, we will delve into the date_sub function in PySpark. This versatile function allows us to subtract a specified number of days from a given date, enabling us to perform date-based operations and gain valuable insights from our data.

from pyspark.sql.functions import date_sub

Understanding date_sub:

The date_sub function in PySpark facilitates date subtraction by subtracting a specified number of days from a given date. It helps us analyze historical data, calculate intervals, and perform various time-based computations within our Spark applications.

Syntax:
The syntax for using date_sub in PySpark is as follows:

date_sub(start_date, days)

Here, start_date represents the initial date from which we want to subtract days, and days indicates the number of days to subtract.

Example Usage:
To illustrate the usage of date_sub in PySpark, let’s consider a scenario where we have a dataset containing sales records. We want to analyze sales data from the past 7 days.

Step 1: Importing the necessary libraries and creating a SparkSession.

from pyspark.sql import SparkSession
from pyspark.sql.functions import date_sub

# Create a SparkSession
spark = SparkSession.builder \
    .appName("date_sub Example at Freshers.in") \
    .getOrCreate()

Step 2: Creating a sample DataFrame with hardcoded values.

# Sample DataFrame with hardcoded values
data = [("Product A", "2023-05-15", 100),
        ("Product B", "2023-05-16", 150),
        ("Product C", "2023-05-17", 200),
        ("Product D", "2023-05-18", 120),
        ("Product E", "2023-05-19", 90),
        ("Product F", "2023-05-20", 180),
        ("Product G", "2023-05-21", 210),
        ("Product H", "2023-05-22", 160)]
df = spark.createDataFrame(data, ["Product", "Date", "Sales"])

# Show the initial DataFrame
df.show()

ResultĀ 

+---------+----------+-----+
| Product |   Date   |Sales|
+---------+----------+-----+
|Product A|2023-05-15|  100|
|Product B|2023-05-16|  150|
|Product C|2023-05-17|  200|
|Product D|2023-05-18|  120|
|Product E|2023-05-19|   90|
|Product F|2023-05-20|  180|
|Product G|2023-05-21|  210|
|Product H|2023-05-22|  160|
+---------+----------+-----+

Step 3: Subtracting days using date_sub.

# Subtract 7 days from the current date
df_subtracted = df.withColumn("SubtractedDate", date_sub(df.Date, 7))

# Show the resulting DataFrame
df_subtracted.show()

ResultĀ 

+---------+----------+-----+--------------+
|  Product|      Date|Sales|SubtractedDate|
+---------+----------+-----+--------------+
|Product A|2023-05-15|  100|    2023-05-08|
|Product B|2023-05-16|  150|    2023-05-09|
|Product C|2023-05-17|  200|    2023-05-10|
|Product D|2023-05-18|  120|    2023-05-11|
|Product E|2023-05-19|   90|    2023-05-12|
|Product F|2023-05-20|  180|    2023-05-13|
|Product G|2023-05-21|  210|    2023-05-14|
|Product H|2023-05-22|  160|    2023-05-15|
+---------+----------+-----+--------------+

In the above code snippet, we used the `date_sub` function to subtract 7 days from the “Date” column in the DataFrame. The resulting column, “SubtractedDate,” contains the dates obtained after subtracting 7 days.

Step 4: Filtering data based on the subtracted date.

# Filter sales data from the past 7 days
recent_sales = df_subtracted.filter(df_subtracted.SubtractedDate >= '2023-05-15')

# Show the filtered DataFrame
recent_sales.show()

Result

+---------+----------+-----+--------------+
| Product |   Date   |Sales|SubtractedDate|
+---------+----------+-----+--------------+
|Product H|2023-05-22|  160|    2023-05-15|
+---------+----------+-----+--------------+

By filtering the DataFrame based on the “SubtractedDate” column, we obtained sales data from the past 7 days. In this case, we selected records where the subtracted date was greater than or equal to ‘2023-05-15’.

Here we explored the functionality of PySpark’s date_sub function, which allows us to subtract a specified number of days from a given date. By incorporating this powerful function into our PySpark workflows, we can perform date-based operations, analyze historical data, and gain valuable insights from our datasets. Whether it’s calculating intervals, filtering data based on specific timeframes, or performing time-based computations, the date_sub function proves to be an invaluable tool for date subtraction in PySpark applications.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply