Data parsing in BigQuery: REGEXP_EXTRACT() – Capture specific patterns within a text field

Google Big Query @ Freshers.in

This article provides an in-depth understanding of REGEXP_EXTRACT(), complete with examples you can run directly in BigQuery.

Understanding REGEXP_EXTRACT():

The REGEXP_EXTRACT() function in BigQuery uses regular expressions to capture specific patterns within a text field. The function returns the part of the string matching the regular expression pattern you specify. Its basic syntax is:

REGEXP_EXTRACT(expression, regex_pattern)

Here, ‘expression‘ is the string from which you want to extract data, and ‘regex_pattern’ is the regular expression defining the pattern of the data to be extracted.

Let’s dissect this with practical examples. For the following examples, we’ll use a mock dataset representing web traffic data.

Sample data: First, let’s create a sample dataset and table in BigQuery. You can run the following commands in the BigQuery console:

-- Create a dataset
CREATE SCHEMA IF NOT EXISTS web_data;

-- Create a table
CREATE TABLE IF NOT EXISTS web_data.WebTraffic (
  id INT64,
  url STRING,
  email STRING
);

-- Insert sample data
INSERT INTO web_data.WebTraffic (id, url, email) VALUES
  (1, 'https://www.example.com/query?user=sachin', 'sachin.s@example.com'),
  (2, 'https://www.another-example.com/query?user=rajesh', 'rajesh.nair@another-example.com'),
  (3, 'https://www.yetanotherexample.com/query?user=arun', 'arun.prakash@yetanotherexample.com');

Example 1:

Extracting domain names:

Suppose we want to extract the domain names from the URLs in our data. We’ll use REGEXP_EXTRACT() to specify a pattern that captures the domain name from each URL.

SELECT 
  id,
  url,
  REGEXP_EXTRACT(url, r'https://www.([a-zA-Z0-9-]+).com') as domain
FROM 
  web_data.WebTraffic;

This query will return the ‘id’ and ‘url’, along with the domain names extracted from the URLs.

Example 2:

Extracting email usernames:

Now, let’s say we want to extract the username from each email address (the part before “@”). Again, we’ll use REGEXP_EXTRACT() to isolate this portion of the email string.

SELECT 
  id,
  email,
  REGEXP_EXTRACT(email, r'([a-zA-Z0-9._%-]+)@') as username
FROM 
  web_data.WebTraffic;

This query extracts and returns the username part of the email addresses, giving you rows of ‘id’, ’email’, and ‘username’.

Understanding the REGEX Pattern: Regular expressions (REGEX) are sequences of characters that define a search pattern. In the examples above, we used ‘r’ before the pattern string, signifying a raw string literal, making backslashes () be treated as literal characters. It’s a common practice when dealing with regex patterns in BigQuery to avoid special character clashes.

The patterns themselves ('https://www.([a-zA-Z0-9-]+).com' and '([a-zA-Z0-9._%-]+)@') use a combination of literal characters and character sets (like [a-zA-Z0-9-]) to match specific portions of the input strings. The parentheses () are capturing groups used to delineate the exact substring you want to extract.

Result 
id url domain
1 https://www.example.com/query?user=sachin example
2 https://www.another-example.com/query?user=rajesh another-example
3 https://www.yetanotherexample.com/query?user=arun yetanotherexample
id email username
1 sachin.s@example.com sachin.s
2 rajesh.nair@another-example.com rajesh.nair
3 arun.prakash@yetanotherexample.com arun.prakash
BigQuery import urls to refer
Author: user