PySpark : Getting the Next and Previous Day from a Timestamp

PySpark @ Freshers.in

In data processing and analysis, there can often arise situations where you might need to compute the next day or the previous day from a given date or timestamp. This article will guide you through the process of accomplishing these tasks using PySpark, the Python library for Apache Spark. Detailed examples will be provided to ensure a clear understanding of these operations.

Setting Up the Environment

Firstly, we need to set up our PySpark environment. Assuming you have properly installed Spark and PySpark, you can initialize a SparkSession as follows:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("Freshers.in Learning @ Next Day and Previous Day") \
    .getOrCreate()

Creating a DataFrame with Timestamps

Let’s start by creating a DataFrame containing some sample timestamps:

from pyspark.sql import functions as F
from pyspark.sql.types import TimestampType
data = [("2023-01-15 13:45:30",), ("2023-02-22 08:20:00",), ("2023-07-07 22:15:00",)]
df = spark.createDataFrame(data, ["Timestamp"])
df = df.withColumn("Timestamp", F.col("Timestamp").cast(TimestampType()))
df.show(truncate=False)
+-------------------+
|Timestamp          |
+-------------------+
|2023-01-15 13:45:30|
|2023-02-22 08:20:00|
|2023-07-07 22:15:00|
+-------------------+

Getting the Next Day

To get the next day from each timestamp, we use the date_add function, passing in the timestamp column and the number 1 to indicate that we want to add one day:

df.withColumn("Next_Day", F.date_add(F.col("Timestamp"), 1)).show(truncate=False)
+-------------------+----------+
|Timestamp          |Next_Day  |
+-------------------+----------+
|2023-01-15 13:45:30|2023-01-16|
|2023-02-22 08:20:00|2023-02-23|
|2023-07-07 22:15:00|2023-07-08|
+-------------------+----------+

The Next_Day column shows the date of the day after each timestamp.

Getting the Previous Day

To get the previous day, we use the date_sub function, again passing in the timestamp column and the number 1 to indicate that we want to subtract one day:

df.withColumn("Previous_Day", F.date_sub(F.col("Timestamp"), 1)).show(truncate=False)
+-------------------+------------+
|Timestamp          |Previous_Day|
+-------------------+------------+
|2023-01-15 13:45:30|2023-01-14  |
|2023-02-22 08:20:00|2023-02-21  |
|2023-07-07 22:15:00|2023-07-06  |
+-------------------+------------+

The Previous_Day column shows the date of the day before each timestamp.

PySpark provides simple yet powerful functions for manipulating dates and timestamps. The date_add and date_sub functions allow us to easily compute the next day and previous day from a given date or timestamp.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply