PySpark : unix_timestamp function – A comprehensive guide

PySpark @ Freshers.in

One of the key functionalities of PySpark is the ability to transform data into the desired format. In some cases, it is necessary to convert a date or timestamp into a numerical format for further analysis. The unix_timestamp() function is a PySpark function that helps to convert a date or timestamp to a Unix timestamp, which is a numerical representation of time.

In this article, we will discuss the unix_timestamp() function in PySpark in detail, including its syntax, parameters, and examples.

Syntax
The unix_timestamp() function in PySpark has the following syntax:

The timestamp parameter is the date or timestamp that you want to convert to Unix timestamp. The format parameter is an optional parameter that specifies the format of the input timestamp. If the format parameter is not specified, PySpark will use the default format, which is “yyyy-MM-dd HH:mm:ss”.

Parameters
The unix_timestamp() function in PySpark takes two parameters:

timestamp: This is the date or timestamp that you want to convert to Unix timestamp. The timestamp can be a string or a column of a DataFrame.
format: This is an optional parameter that specifies the format of the input timestamp. The format should be a string that conforms to the date format pattern syntax. If this parameter is not specified, PySpark will use the default format, which is “yyyy-MM-dd HH:mm:ss”.
Examples
Let’s look at some examples to understand how the unix_timestamp() function works in PySpark.

Example 1: Converting a Timestamp to Unix Timestamp
Suppose we have a timestamp “2022-03-24 12:30:00” that we want to convert to Unix timestamp. We can use the unix_timestamp() function to do this as follows:

from pyspark.sql.functions import unix_timestamp
spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
time_df = spark.createDataFrame([('2023-04-08',)], ['dt'])
time_df.select(unix_timestamp('dt', 'yyyy-MM-dd').alias('unix_time')).collect()
In this example, we pass the timestamp as a string to the unix_timestamp() function. The function returns the Unix timestamp of the input timestamp.

Output:

[Row(unix_time=1680937200)]

Example 2: Converting a Timestamp with a Custom Format to Unix Timestamp
Suppose we have a timestamp “03-24-2022 12:30:00 PM” with a custom format that we want to convert to Unix timestamp. We can use the unix_timestamp() function and specify the format parameter to do this as follows:

Output:

from pyspark.sql.functions import unix_timestamp
spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
time_df = spark.createDataFrame([('2023-04-08',)], ['dt'])
format = "yyyy-MM-dds"
time_df.select(unix_timestamp('dt', format).alias('unix_time')).collect()

Output:

[Row(unix_time=1680937200)]

In this example, we pass the timestamp and format parameters to the unix_timestamp() function. The function returns the Unix timestamp of the input timestamp using the specified format.

Example 3: Converting a Timestamp Column to Unix Timestamp Column in a DataFrame
Suppose we have a DataFrame that contains a timestamp column “timestamp” that we want to convert to a Unix timestamp column. We can use the unix_timestamp() function with the col() function to do this as follows:

from pyspark.sql.functions import unix_timestamp, col
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("UnixTimestamp @Freshers.in ").getOrCreate()
df = spark.createDataFrame([
    ("2023-04-24 12:30:00",),
    ("2023-04-24 13:30:00",),
    ("2023-04-24 14:30:00",),
    ("2023-04-24 15:30:00",)
], ["timestamp"])
df = df.withColumn("unix_timestamp", unix_timestamp(col("timestamp")))
df.show()

Output

timestamp               |unix timestamp   |
|12023-04-24	12:30:00|	1682364600|
|12023-04-24	13:30:00|	1682368200|
|12023-04-24	14:30:00|	1682371800|
|12023-04-24	15:30:00|	1682375400|

In this example, we first create a DataFrame with a timestamp column “timestamp”. We then use the withColumn() function to add a new column “unix_timestamp” to the DataFrame, which contains the Unix timestamp of the “timestamp” column. We use the col() function to refer to the “timestamp” column in the DataFrame.

The unix_timestamp() function is a useful PySpark function for converting a date or timestamp to Unix timestamp. In this article, we discussed the syntax and parameters of the unix_timestamp() function, as well as provided some examples of how to use the function. The unix_timestamp() function is a powerful tool for transforming data and can be used in various data processing scenarios.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply