This article will explain how to add or subtract a specific number of months from a date or timestamp while preserving end-of-month information. This is especially useful when dealing with financial, retail, or similar data, where preserving the end-of-month status of a date is critical.
Setting up the Environment
Before we begin, we must set up our PySpark environment. Assuming you’ve installed Spark and PySpark properly, you should be able to initialize a SparkSession as follows:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("freshers.in Learning Adding and Subtracting Months ").getOrCreate()
Understanding add_months and date_add Functions
We will utilize PySpark’s built-in functions add_months and date_add or date_sub for our operations. The add_months function adds a specified number of months to a date, and if the original date was the last day of the month, the resulting date will also be the last day of the new month.
The date_add or date_sub function, on the other hand, adds or subtracts a certain number of days from a date, which is not ideal for preserving end-of-month information.
Using add_months Function
To demonstrate, let’s create a DataFrame with some sample dates:
from pyspark.sql import functions as F
from pyspark.sql.types import DateType
data = [("2023-01-31",), ("2023-02-28",), ("2023-07-15",)]
df = spark.createDataFrame(data, ["Date"])
df = df.withColumn("Date", F.col("Date").cast(DateType()))
df.show()
+----------+
| Date|
+----------+
|2023-01-31|
|2023-02-28|
|2023-07-15|
+----------+
Now, we will add two months to each date using add_months:
df.withColumn("New_Date", F.add_months(F.col("Date"), 2)).show()
+----------+----------+
| Date| New_Date|
+----------+----------+
|2023-01-31|2023-03-31|
|2023-02-28|2023-04-28|
|2023-07-15|2023-09-15|
+----------+----------+
Note how the dates originally at the end of a month are still at the end of the month in the New_Date column.
Subtracting Months
Subtracting months is as simple as adding months. We simply use a negative number as the second parameter to the add_months function:
df.withColumn("New_Date", F.add_months(F.col("Date"), -2)).show()
+----------+----------+
| Date| New_Date|
+----------+----------+
|2023-01-31|2022-11-30|
|2023-02-28|2022-12-28|
|2023-07-15|2023-05-15|
+----------+----------+
Adding or Subtracting Months to a Timestamp
To work with timestamps instead of dates, we need to cast our column to a TimestampType. Let’s create a new DataFrame to demonstrate:
from pyspark.sql.types import TimestampType
data = [("2023-01-31 13:45:30",), ("2023-02-28 08:20:00",), ("2023-07-15 22:15:00",)]
df = spark.createDataFrame(data, ["Timestamp"])
df = df.withColumn("Timestamp", F.col("Timestamp").cast(TimestampType()))
df.show(truncate=False)
+-------------------+
|Timestamp |
+-------------------+
|2023-01-31 13:45:30|
|2023-02-28 08:20:00|
|2023-07-15 22:15:00|
+-------------------+
Then, we can add or subtract months as before:
df.withColumn("New_Timestamp", F.add_months(F.col("Timestamp"), 2)).show(truncate=False)
df.withColumn("New_Timestamp", F.add_months(F.col("Timestamp"), -2)).show(truncate=False)
+-------------------+-------------+
|Timestamp |New_Timestamp|
+-------------------+-------------+
|2023-01-31 13:45:30|2023-03-31 |
|2023-02-28 08:20:00|2023-04-28 |
|2023-07-15 22:15:00|2023-09-15 |
+-------------------+-------------+
+-------------------+-------------+
|Timestamp |New_Timestamp|
+-------------------+-------------+
|2023-01-31 13:45:30|2022-11-30 |
|2023-02-28 08:20:00|2022-12-28 |
|2023-07-15 22:15:00|2023-05-15 |
+-------------------+-------------+
Spark important urls to refer