PySpark – How to convert string date to Date datatype

PySpark @ Freshers.in

pyspark.sql.functions.to_date

In this article will give you brief on how can you convert string date to Date datatype . With pyspark.sql.functions.to_date, column is transformed into pyspark.sql.types. DateType using the optionally specified format. Formats should be specified using the date/time pattern. By default, it adheres to the “pyspark.sql.types” casting rules if the format is omitted. This is similar to col.cast (“date”).

Sample code

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import IntegerType,StringType,FloatType
from pyspark.sql.functions import to_date
from pyspark.sql.functions import current_date
cust_data = ([
(1,"Tim John","05-12-2000"),\
(2,"Catherine Walter","21-02-1999"),\
(3,"John Wincent","15-03-2001"),\
(4,"Barbara Peter","11-04-2000"),\
(5,"Arron Smith","28-05-2000")])
cust_schema=StructType([
StructField("si_no",IntegerType(),True),
StructField("customer_name",StringType(),True),
StructField("customer_dob",StringType(),True)])
cust_data_df = spark.createDataFrame(data=cust_data,schema=cust_schema)
cust_data_df.printSchema()
root
 |-- si_no: integer (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- customer_dob: string (nullable = true)
cust_data_df.show()
+-----+----------------+------------+
|si_no|   customer_name|customer_dob|
+-----+----------------+------------+
|    1|        Tim John|  05-12-2000|
|    2|Catherine Walter|  21-02-1999|
|    3|    John Wincent|  15-03-2001|
|    4|   Barbara Peter|  11-04-2000|
|    5|     Arron Smith|  28-05-2000|
+-----+----------------+------------+
cust_data_df2 = cust_data_df.select(cust_data_df["si_no"],\
cust_data_df["customer_name"],\
cust_data_df["customer_dob"],\
to_date(cust_data_df["customer_dob"],"dd-MM-yyyy").alias("dob_dt"))
cust_data_df2.printSchema()
root
 |-- si_no: integer (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- customer_dob: string (nullable = true)
 |-- dob_dt: date (nullable = true)
cust_data_df2.show()
+-----+----------------+------------+----------+
|si_no|   customer_name|customer_dob|    dob_dt|
+-----+----------------+------------+----------+
|    1|        Tim John|  05-12-2000|2000-12-05|
|    2|Catherine Walter|  21-02-1999|1999-02-21|
|    3|    John Wincent|  15-03-2001|2001-03-15|
|    4|   Barbara Peter|  11-04-2000|2000-04-11|
|    5|     Arron Smith|  28-05-2000|2000-05-28|
+-----+----------------+------------+----------+

In the above when you compare the printSchema of cust_data_df and cust_data_df2 , you can find the difference of datatype. There will be multiple situation like, when you want to calculate date, etc you may need to convert this

If there is time stamp in the source and you want to skip the time then well then you can use ” yyyy-MM-dd HH:mm:ss” for conversion.

example

cust_data_df2 = cust_data_df.select(cust_data_df["si_no"],\
cust_data_df["customer_name"],\
cust_data_df["customer_dob"],\
to_date(cust_data_df["customer_dob"],"dd-MM-yyyy HH:mm:ss").alias("dob_dt"))

To keep in mind, the above will convert to date only , not to timestamp

Reference

  1. Spark Interview Questions
  2. Spark Examples
  3. PySpark Blogs
  4. Bigdata Blogs
  5. Official Page
Author: user

Leave a Reply