PySpark : from_utc_timestamp Function: A Detailed Guide

PySpark @ Freshers.in

The from_utc_timestamp function  in PySpark is a highly useful function that allows users to convert UTC time to a specified timezone. This conversion can be essential when you’re dealing with data that spans different time zones. In this article, we’re going to deep dive into this function, exploring its syntax, use-cases, and providing examples for a better understanding.

Syntax

The function from_utc_timestamp accepts two parameters:

1. The timestamp to convert from UTC.

2. The string that represents the timezone to convert to.

The syntax is as follows:

from pyspark.sql.functions import from_utc_timestamp
from_utc_timestamp(timestamp, tz)

Use-Case Scenario

Imagine you’re a data analyst working with a global company that receives sales data from different regions around the world. The data you’re working with includes the timestamp of each transaction, which is stored in UTC time. However, for your analysis, you need to convert these timestamps into local times to get a more accurate picture of customer behaviors during their local hours. Here, the comes from_utc_timestamp function into play.

Detailed Examples

First, let’s start by creating a PySpark session:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Learning @ Freshers.in from_utc_timestamp').getOrCreate()

Let’s assume we have a data frame with sales data, which includes a timestamp column with UTC times. We’ll use hardcoded values for simplicity:

from pyspark.sql.functions import to_utc_timestamp, lit
from pyspark.sql.types import TimestampType
data = [("1", "2023-01-01 13:30:00"), 
        ("2", "2023-02-01 14:00:00"), 
        ("3", "2023-03-01 15:00:00")]
df = spark.createDataFrame(data, ["sale_id", "timestamp"])
# Cast the timestamp column to timestamp type
df = df.withColumn("timestamp", df["timestamp"].cast(TimestampType()))

Now, our data frame has a ‘timestamp’ column with UTC times. Let’s convert these to New York time using the from_utc_timestamp function:

from pyspark.sql.functions import from_utc_timestamp
df = df.withColumn("NY_time", from_utc_timestamp(df["timestamp"], "America/New_York"))
df.show(truncate=False)
Output
+-------+-------------------+-------------------+
|sale_id|timestamp          |NY_time            |
+-------+-------------------+-------------------+
|1      |2023-01-01 13:30:00|2023-01-01 08:30:00|
|2      |2023-02-01 14:00:00|2023-02-01 09:00:00|
|3      |2023-03-01 15:00:00|2023-03-01 10:00:00|
+-------+-------------------+-------------------+

As you can see, the from_utc_timestamp function correctly converted our UTC times to New York local times considering the time difference.

Remember that PySpark supports all timezones that are available in Python. To list all available timezones, you can use pytz library:

import pytz
for tz in pytz.all_timezones:
    print(tz)
Author: user

Leave a Reply