PySpark how to find the date difference between two date and how to round it just days without decimal (datediff,floor)

PySpark @ Freshers.in

pyspark.sql.functions.datediff and pyspark.sql.functions.floor

In this article we will learn two function , mainly datediff and floor.

pyspark.sql.functions.datediff : To get difference between two dates in days in pyspark you can use datediff().

pyspark.sql.functions.floor : The column name is sent as a parameter to the function in PySpark, which rounds down the column and stores the results in a separate column as seen below.

Example : In this example you will see how to find the age of a person by giving the date of birth.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import IntegerType,StringType,FloatType,DateType 
from pyspark.sql.functions import datediff
from pyspark.sql.functions import current_date
from pyspark.sql import functions as F
import datetime
cust_data = ([
(1,"Jack Willsmith",datetime.datetime.strptime('2000-02-01', "%Y-%m-%d").date()),\
(2,"Twinkle We",datetime.datetime.strptime('2001-03-12', "%Y-%m-%d").date()),\
(3,"Abil Wicky",datetime.datetime.strptime('2002-01-22', "%Y-%m-%d").date()),\
(4,"Kore Sam",datetime.datetime.strptime('2001-06-01', "%Y-%m-%d").date()),\
(5,"Bob Smith",datetime.datetime.strptime('2001-07-02', "%Y-%m-%d").date())])
cust_schema=StructType([
StructField("si_no",IntegerType(),True),
StructField("customer_name",StringType(),True),
StructField("customer_dob",DateType(),True)])
cust_data_df = spark.createDataFrame(data=cust_data,schema=cust_schema)
cust_data_df.printSchema()
cust_data_df.show()
cust_data_df.withColumn('age',F.floor(datediff(current_date(),cust_data_df["customer_dob"]))).show()
cust_data_df.withColumn('age',F.floor(datediff(current_date(),cust_data_df["customer_dob"])/365)).show()

Output

Source Schema

root
 |-- si_no: integer (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- customer_dob: date (nullable = true)

Raw DataFrame

+-----+--------------+------------+
|si_no| customer_name|customer_dob|
+-----+--------------+------------+
|    1|Jack Willsmith|  2000-02-01|
|    2|    Twinkle We|  2001-03-12|
|    3|    Abil Wicky|  2002-01-22|
|    4|      Kore Sam|  2001-06-01|
|    5|     Bob Smith|  2001-07-02|
+-----+--------------+------------+

Datediff with number of days : current date – customer_dob [ When I write this post its 03-09-2022 ]

+-----+--------------+------------+----+
|si_no| customer_name|customer_dob| age|
+-----+--------------+------------+----+
|    1|Jack Willsmith|  2000-02-01|8250|
|    2|    Twinkle We|  2001-03-12|7845|
|    3|    Abil Wicky|  2002-01-22|7529|
|    4|      Kore Sam|  2001-06-01|7764|
|    5|     Bob Smith|  2001-07-02|7733|
+-----+--------------+------------+----+

Calculation of age by dividing by 365

+-----+--------------+------------+---+
|si_no| customer_name|customer_dob|age|
+-----+--------------+------------+---+
|    1|Jack Willsmith|  2000-02-01| 22|
|    2|    Twinkle We|  2001-03-12| 21|
|    3|    Abil Wicky|  2002-01-22| 20|
|    4|      Kore Sam|  2001-06-01| 21|
|    5|     Bob Smith|  2001-07-02| 21|
+-----+--------------+------------+---+

Reference

  1. Spark Interview Questions
  2. Spark Examples
  3. PySpark Blogs
  4. Bigdata Blogs
  5. Official Page
Author: user

Leave a Reply