PySpark : How to read date datatype from CSV ?

PySpark @ Freshers.in

We specify schema = true when a CSV file is being read. Spark determines the data type of a column by setting this and using the values that are stored there. However, because a spark cannot deduce a schema for date and timestamp value fields, it reads these elements as strings instead. We are concentrating on many approaches to solving this problem in this recipe.

Here we explained in dataframe method as well as Spark SQL way of converting to date datatype 

pyspark.sql.functions.to_date

A Column is transformed into pyspark.sql.types, using the optionally supplied format, DateType. Formats should be specified using the date/time pattern. It automatically adheres to the pyspark.sql.types casting conventions. This is similar to col.cast (“date”).

Sample code to show how to_date works

from pyspark.sql import SparkSession
from pyspark.sql.types import StringType,IntegerType
from pyspark.sql.types import StructType,StructField
spark = SparkSession.builder.appName('www.freshers.in training : to_date ').getOrCreate()
from pyspark.sql.functions import to_date
car_data = [
(1,"Japan","2023-01-11"),
(2,"Italy","2023-04-21"),
(3,"France","2023-05-22"),
(4,"India","2023-07-18"),
(5,"USA","2023-08-23"),
]
car_data_schema = StructType([
StructField("si_no",IntegerType(),True),
StructField("country_origin",StringType(),True),
StructField("car_make_year",StringType(),True)
])
car_df = spark.createDataFrame(data=car_data, schema=car_data_schema)
car_df.printSchema()
root
 |-- si_no: integer (nullable = true)
 |-- country_origin: string (nullable = true)
 |-- car_make_year: string (nullable = true)

Applying to_date function 

car_df_updated = car_df.withColumn("car_make_year_dt",to_date("car_make_year"))
car_df_updated.show()
+-----+--------------+-------------+----------------+
|si_no|country_origin|car_make_year|car_make_year_dt|
+-----+--------------+-------------+----------------+
|    1|         Japan|   2023-01-11|      2023-01-11|
|    2|         Italy|   2023-04-21|      2023-04-21|
|    3|        France|   2023-05-22|      2023-05-22|
|    4|         India|   2023-07-18|      2023-07-18|
|    5|           USA|   2023-08-23|      2023-08-23|
+-----+--------------+-------------+----------------+

Check the schema that is going to print , you can see the date data time for the new column car_make_year_dt

car_df_updated.printSchema()
root
 |-- si_no: integer (nullable = true)
 |-- country_origin: string (nullable = true)
 |-- car_make_year: string (nullable = true)
<span style="color: #0000ff;"> |-- car_make_year_dt: date (nullable = true)</span>

The above can be done in the SQL way as follows by creating a TempView 

car_df.createOrReplaceTempView("car_table")
spark.sql("select si_no,country_origin, to_date(car_make_year) from car_table").show()
+-----+--------------+----------------------------------+
|si_no|country_origin|to_date(car_table.`car_make_year`)|
+-----+--------------+----------------------------------+
|    1|         Japan|                        2023-01-11|
|    2|         Italy|                        2023-04-21|
|    3|        France|                        2023-05-22|
|    4|         India|                        2023-07-18|
|    5|           USA|                        2023-08-23|
+-----+--------------+----------------------------------+

For checking the schema 

spark.sql("select si_no,country_origin, to_date(car_make_year) from car_table").printSchema()
root
 |-- si_no: integer (nullable = true)
 |-- country_origin: string (nullable = true)
<span style="color: #0000ff;"> |-- to_date(car_table.`car_make_year`): date (nullable = true)</span>

Reference

  1. Spark Interview Questions
  2. Spark Examples
  3. PySpark Blogs
  4. Bigdata Blogs
  5. Official Page
Author: user

Leave a Reply