Accepting date in a Dataframe
When you define a data in a a list of tuple and trying to read the date column , you will get an error as DateType can not accept object ‘YYYY-MM-DD’ in type <class ‘str’> . This can happen in the case of Time Stamp field as well
Consider we have the data as
1,"Japan","2023-01-01"
2,"Italy","2023-01-01"
3,"France","2023-01-01"
We are going to read this by specifying the schema
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType,IntegerType,DateType
from pyspark.sql.types import StructType,StructField
spark = SparkSession.builder.appName('www.freshers.in training').getOrCreate()
from datetime import datetime
car_data = [
(1,"Japan","2023-01-01"),
(2,"Italy","2023-01-01"),
(3,"France","2023-01-01"),
]
car_data_schema = StructType([
StructField("si_no",IntegerType(),True),
StructField("country_origin",StringType(),True),
StructField("car_make_year",DateType(),True)
])
car_df = spark.createDataFrame(data=car_data, schema=car_data_schema)
We will get the error as
TypeError: field car_make_year: DateType can not accept object ‘2023-01-01’ in type <class ‘str’>
How to Solve this .
For that we need to have the date ( Which is in string ) converted to <class ‘datetime.datetime’>
For easy understanding , I will show how the data needs to be
car_data = [
(1,"Japan",datetime.strptime("2023-01-01","%Y-%m-%d")),
(2,"Italy",datetime.strptime("2023-01-01","%Y-%m-%d")),
(3,"France",datetime.strptime("2023-01-01","%Y-%m-%d"))
]
Complete code
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType,IntegerType,DateType
from pyspark.sql.types import StructType,StructField
spark = SparkSession.builder.appName('www.freshers.in training').getOrCreate()
from datetime import datetime
car_data = [
(1,"Japan",datetime.strptime("2023-01-01","%Y-%m-%d")),
(2,"Italy",datetime.strptime("2023-01-01","%Y-%m-%d")),
(3,"France",datetime.strptime("2023-01-01","%Y-%m-%d"))
]
car_data_schema = StructType([
StructField("si_no",IntegerType(),True),
StructField("country_origin",StringType(),True),
StructField("car_make_year",DateType(),True)
])
car_df = spark.createDataFrame(data=car_data, schema=car_data_schema)
car_df.printSchema()
car_df.show()
The result in the Print Schema , you can see the datatype as date
root
|-- si_no: integer (nullable = true)
|-- country_origin: string (nullable = true)
|-- car_make_year: date (nullable = true)
+-----+--------------+-------------+
|si_no|country_origin|car_make_year|
+-----+--------------+-------------+
| 1| Japan| 2023-01-01|
| 2| Italy| 2023-01-01|
| 3| France| 2023-01-01|
+-----+--------------+-------------+
Reference