Extracting minutes from timestamp in Google BigQuery and handling in PySpark

Google Big Query @ Freshers.in

Often in data analytics, there’s a need to extract specific parts of a date or timestamp for more granular analysis. One such common operation is extracting minutes from a timestamp. This tutorial focuses on achieving this in Google BigQuery and processing the results with PySpark.

Step 1: Setting up the Table in BigQuery

We’ll first create a table named freshers_in_page_views in BigQuery with sample data.

CREATE TABLE freshers_in_page_views (
    view_id INT64,
    view_timestamp TIMESTAMP,
    view_count INT64
);

Step 2: Inserting Sample Data

Let’s insert some rows of sample data for demonstration.

Insert Statement:
INSERT INTO freshers_in_page_views (view_id, view_timestamp, view_count)
VALUES
(1, TIMESTAMP "2023-09-14 15:43:22 UTC", 120),
(2, TIMESTAMP "2023-09-14 12:05:45 UTC", 89),
(3, TIMESTAMP "2023-09-14 09:15:00 UTC", 155);
INSERT INTO freshers_in_page_views (view_id, view_timestamp, view_count)
VALUES
(1, TIMESTAMP "2023-09-14 15:43:22 UTC", 120),
(2, TIMESTAMP "2023-09-14 12:05:45 UTC", 89),
(3, TIMESTAMP "2023-09-14 09:15:00 UTC", 155);

Step 3: Extracting Minutes from Timestamp in BigQuery

Using the EXTRACT function in BigQuery, we can retrieve the minutes of our timestamps.

Select Statement:
SELECT
    view_id,
    EXTRACT(MINUTE FROM view_timestamp) AS minute_of_hour,
    view_count
FROM freshers_in_page_views;
Output
view_id | minute_of_hour | view_count
------- | -------------- | ----------
1       | 43             | 120
2       | 5              | 89
3       | 15             | 155

Step 4: Processing the Result in PySpark

Once you’ve exported the results from BigQuery, you can utilize PySpark for further processing.

from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("bigquery-pyspark @ Learning at Freshers.in ").getOrCreate()
# Load the data (assuming a CSV export from BigQuery)
df = spark.read.csv("path_to_your_exported_file.csv", header=True, inferSchema=True)
# Displaying the dataframe with minutes extracted
df.show()

BigQuery import urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply