pyspark.sql.functions.date_trunc(format, timestamp)
Truncation function offered by Spark Dateframe SQL functions is date_trunc(), which returns Date in the format “yyyy-MM-dd HH:mm:ss.SSSS” and truncates at Year, Month, Day, Hour, Minute, and Seconds units. This will returns timestamp truncated to the unit specified by the format.
from pyspark.sql.functions import date_trunc
from pyspark.sql.functions import lit
from pyspark.sql import SparkSession
# Initialize a Spark session
spark = SparkSession.builder.appName("date_trunc_example").getOrCreate()
# Create a dataframe with a single column of timestamps
df = spark.createDataFrame([
("2023-01-01 10:10:10",),
("2023-02-01 11:11:11",),
("2023-03-01 12:12:12",),
("2023-03-31 11:11:11",),
("2023-01-31 11:11:11",),
("2023-02-01 11:11:11",)], ["timestamp"])
# Round down each timestamp to the nearest month
df_truncated = df.withColumn("month", date_trunc("month", df["timestamp"]))
df_truncated.show(20,False)
+-------------------+-------------------+
| timestamp| month|
+-------------------+-------------------+
|2023-01-01 10:10:10|2023-01-01 00:00:00|
|2023-02-01 11:11:11|2023-02-01 00:00:00|
|2023-03-01 12:12:12|2023-03-01 00:00:00|
|2023-03-31 11:11:11|2023-03-01 00:00:00|
|2023-01-31 11:11:11|2023-01-01 00:00:00|
|2023-02-01 11:11:11|2023-02-01 00:00:00|
+-------------------+-------------------+
In this example, we use the date_trunc function to round down the timestamps in the “timestamp” column to the nearest month. The output is a new dataframe with two columns: the original “timestamp” column and a new column “month” which contains the rounded-down timestamps.
pyspark.sql.functions.trunc
The trunc function in PySpark can also be used to truncate date or timestamp values to a specified unit of time (e.g. year, quarter, month, day, etc.).
Here’s an example of using trunc on date or timestamp values in PySpark:
from pyspark.sql.functions import trunc
from pyspark.sql.functions import lit
from pyspark.sql import SparkSession
# Initialize a Spark session
spark = SparkSession.builder.appName("trunc example").getOrCreate()
# Create a dataframe with a single column of timestamps
df = spark.createDataFrame([
("2023-01-01 10:10:10",),
("2023-02-01 11:11:11",),
("2023-03-01 12:12:12",),
("2023-03-31 11:11:11",),
("2023-01-31 11:11:11",),
("2023-02-01 11:11:11",)], ["timestamp"])
df.select(df.timestamp,trunc(df.timestamp, 'year').alias('year')).show()
+-------------------+----------+
| timestamp| year|
+-------------------+----------+
|2023-01-01 10:10:10|2023-01-01|
|2023-02-01 11:11:11|2023-01-01|
|2023-03-01 12:12:12|2023-01-01|
|2023-03-31 11:11:11|2023-01-01|
|2023-01-31 11:11:11|2023-01-01|
|2023-02-01 11:11:11|2023-01-01|
+-------------------+----------+
df.select(df.timestamp,trunc(df.timestamp, 'month').alias('month')).show()
+-------------------+----------+
| timestamp| month|
+-------------------+----------+
|2023-01-01 10:10:10|2023-01-01|
|2023-02-01 11:11:11|2023-02-01|
|2023-03-01 12:12:12|2023-03-01|
|2023-03-31 11:11:11|2023-03-01|
|2023-01-31 11:11:11|2023-01-01|
|2023-02-01 11:11:11|2023-02-01|
+-------------------+----------+
df.select(df.timestamp,trunc(df.timestamp, 'day').alias('day')).show()
+-------------------+----+
| timestamp| day|
+-------------------+----+
|2023-01-01 10:10:10|null|
|2023-02-01 11:11:11|null|
|2023-03-01 12:12:12|null|
|2023-03-31 11:11:11|null|
|2023-01-31 11:11:11|null|
|2023-02-01 11:11:11|null|
+-------------------+----+
Spark important urls to refer