PySpark with Pandas API : How to generates a fixed frequency DatetimeIndex : date_range()

Spark_Pandas_Freshers_in

In PySpark, the Pandas API offers powerful functionalities for working with time series data. One such function is date_range(), which generates a fixed frequency DatetimeIndex. This article provides an in-depth exploration of date_range(), covering its syntax, parameters, and practical applications through illustrative examples.

Understanding date_range()

The date_range() function in the Pandas API on Spark is used to generate a DatetimeIndex with a fixed frequency. It enables the creation of sequences of dates or timestamps, facilitating time series analysis, visualization, and manipulation tasks.

Syntax

The syntax for date_range() is as follows:

pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)

Here, start, end, periods, freq, tz, normalize, name, and closed are the parameters that control the generation of the DatetimeIndex. Each parameter provides flexibility in defining the range, frequency, and timezone of the generated dates.

Examples

Let’s explore various scenarios to understand the functionality of date_range():

Example 1: Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp
import pandas as pd
# Create a SparkSession
spark = SparkSession.builder \
    .appName("date_range_example @ Learning @ Freshers.in ") \
    .getOrCreate()
# Generate a date range from January 1, 2022 to January 5, 2022
date_range = pd.date_range(start='2022-01-01', end='2022-01-05')
# Convert the pandas DateTimeIndex to a Spark DataFrame
df_pandas = pd.DataFrame(date_range, columns=['date'])
df_spark = spark.createDataFrame(df_pandas)
# Show the DataFrame
df_spark.show()
Output
+-------------------+
|               date|
+-------------------+
|2022-01-01 00:00:00|
|2022-01-02 00:00:00|
|2022-01-03 00:00:00|
|2022-01-04 00:00:00|
|2022-01-05 00:00:00|
Example 2: Basic Date Range Generation
import pandas as pd
# Generate a date range from January 1, 2022 to January 5, 2022
date_index = pd.date_range(start='2022-01-01', end='2022-01-05')
print(date_index)
# Output:
Output
DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
               '2022-01-05'],
              dtype='datetime64[ns]', freq='D')

Example 3: Generating Timestamps with Specific Frequency

import pandas as pd
# Generate timestamps every hour for 3 days
date_index = pd.date_range(start='2022-01-01', periods=72, freq='H')
print(date_index)
# Output:
Output
DatetimeIndex(['2022-02-01 00:00:00-05:00', '2022-02-02 00:00:00-05:00',
               '2022-02-03 00:00:00-05:00', '2022-02-04 00:00:00-05:00',
               '2022-02-05 00:00:00-05:00', '2022-02-06 00:00:00-05:00',
               '2022-02-07 00:00:00-05:00', '2022-02-08 00:00:00-05:00',
               '2022-02-09 00:00:00-05:00', '2022-02-10 00:00:00-05:00',
               '2022-02-11 00:00:00-05:00', '2022-02-12 00:00:00-05:00',
               '2022-02-13 00:00:00-05:00', '2022-02-14 00:00:00-05:00',
               '2022-02-15 00:00:00-05:00', '2022-02-16 00:00:00-05:00',
               '2022-02-17 00:00:00-05:00', '2022-02-18 00:00:00-05:00',
               '2022-02-19 00:00:00-05:00', '2022-02-20 00:00:00-05:00',
               '2022-02-21 00:00:00-05:00', '2022-02-22 00:00:00-05:00',
               '2022-02-23 00:00:00-05:00', '2022-02-24 00:00:00-05:00',
               '2022-02-25 00:00:00-05:00', '2022-02-26 00:00:00-05:00',
               '2022-02-27 00:00:00-05:00', '2022-02-28 00:00:00-05:00'],
              dtype='datetime64[ns, America/New_York]', name='date', freq='D')
Author: user