Window functions in Google BigQuery: Analyzing data over specific intervals

Google Big Query @ Freshers.in

Window functions, often termed as analytical or OLAP (Online Analytical Processing) functions, allow users to perform calculations across a set of table rows related to the current row. Unlike aggregate functions, which return a single result per group of rows, window functions return a single result for each row, based on the related rows. This gives you the ability to conduct more complex analyses, especially when you need to maintain the granularity of the data.

Use cases and application of specific functions

Window functions are versatile and offer a multitude of ways to dissect and analyze your data.

CREATE TABLE freshers_in_most_visited_pages (
    id INT64,
    page_name STRING,
    popularity INT64
);

Inserting some sample data into this table:

INSERT INTO freshers_in_most_visited_pages (id, page_name, popularity)
VALUES
(1, 'Home', 100),
(2, 'About', 75),
(3, 'Services', 50),
(4, 'Contact', 60),
(5, 'Blog', 85);

ROW_NUMBER()

Assigns a unique number to each row within the partition based on the order specified.

Use Case: Numbering pages based on their popularity.

SELECT page_name,
       ROW_NUMBER() OVER (ORDER BY popularity DESC) AS popularity_rank
FROM freshers_in_most_visited_pages;

Output:

+-----------+----------------+
| page_name | popularity_rank|
+-----------+----------------+
| Home      | 1              |
| Blog      | 2              |
| About     | 3              |
| Contact   | 4              |
| Services  | 5              |
+-----------+----------------+

LEAD()

Allows you to look ahead by a specified number of rows within the same result set.

Use Case: Comparing the popularity of a page with the next most popular page.

SELECT page_name,
       popularity,
       LEAD(popularity) OVER (ORDER BY popularity DESC) AS next_popular_page_visits
FROM freshers_in_most_visited_pages;

Output:

+-----------+-----------+--------------------------+
| page_name | popularity| next_popular_page_visits |
+-----------+-----------+--------------------------+
| Home      | 100       | 85                       |
| Blog      | 85        | 75                       |
| About     | 75        | 60                       |
| Contact   | 60        | 50                       |
| Services  | 50        | NULL                     |
+-----------+-----------+--------------------------+

LAG()

The opposite of LEAD(), it lets you look behind by a specified number of rows.

Use Case: Comparing the popularity of a page with the previous page.

SELECT page_name,
       popularity,
       LAG(popularity) OVER (ORDER BY popularity DESC) AS previous_popular_page_visits
FROM freshers_in_most_visited_pages;

Output:

+-----------+-----------+-----------------------------+
| page_name | popularity| previous_popular_page_visits|
+-----------+-----------+-----------------------------+
| Home      | 100       | NULL                       |
| Blog      | 85        | 100                        |
| About     | 75        | 85                         |
| Contact   | 60        | 75                         |
| Services  | 50        | 60                         |
+-----------+-----------+-----------------------------+

NTILE()

Divides the result set into a specified number of roughly equal parts.

Use Case: Dividing pages into 4 quartiles based on their popularity.

SELECT page_name,
       popularity,
       NTILE(4) OVER (ORDER BY popularity DESC) AS popularity_quartile
FROM freshers_in_most_visited_pages;

Output:

+-----------+-----------+-------------------+
| page_name | popularity| popularity_quartile|
+-----------+-----------+-------------------+
| Home      | 100       | 1                 |
| Blog      | 85        | 1                 |
| About     | 75        | 2                 |
| Contact   | 60        | 3                 |
| Services  | 50        | 4                 |
+-----------+-----------+-------------------+

Since there are only 5 records, NTILE(4) does not perfectly distribute them into 4 groups, so the quartiles might be imbalanced. Adjustments can be made based on the actual number of records you have.

Benefits of using window functions in analytics

Granularity preservation: Unlike aggregate functions, window functions maintain the granularity of the data, allowing for detailed insights without losing the original dataset’s specifics.

Comparative analysis: Functions like LEAD() and LAG() provide the capability for trend analyses, where you can compare values against subsequent or previous rows.

Segmentation: With NTILE(), data can be segmented into groups or quartiles, which is especially helpful in ranking and percentile analyses.

Efficiency: Instead of using multiple subqueries or joins, window functions streamline the querying process by handling complex computations in a single sweep, thus optimizing performance.

BigQuery import urls to refer

Author: user

Leave a Reply