Data parsing in BigQuery: REGEXP_EXTRACT() – Capture specific patterns within a text field

Google Big Query @

This article provides an in-depth understanding of REGEXP_EXTRACT(), complete with examples you can run directly in BigQuery.

Understanding REGEXP_EXTRACT():

The REGEXP_EXTRACT() function in BigQuery uses regular expressions to capture specific patterns within a text field. The function returns the part of the string matching the regular expression pattern you specify. Its basic syntax is:

REGEXP_EXTRACT(expression, regex_pattern)

Here, ‘expression‘ is the string from which you want to extract data, and ‘regex_pattern’ is the regular expression defining the pattern of the data to be extracted.

Let’s dissect this with practical examples. For the following examples, we’ll use a mock dataset representing web traffic data.

Sample data: First, let’s create a sample dataset and table in BigQuery. You can run the following commands in the BigQuery console:

-- Create a dataset

-- Create a table
CREATE TABLE IF NOT EXISTS web_data.WebTraffic (
  id INT64,
  url STRING,
  email STRING

-- Insert sample data
INSERT INTO web_data.WebTraffic (id, url, email) VALUES
  (1, '', ''),
  (2, '', ''),
  (3, '', '');

Example 1:

Extracting domain names:

Suppose we want to extract the domain names from the URLs in our data. We’ll use REGEXP_EXTRACT() to specify a pattern that captures the domain name from each URL.

  REGEXP_EXTRACT(url, r'https://www.([a-zA-Z0-9-]+).com') as domain

This query will return the ‘id’ and ‘url’, along with the domain names extracted from the URLs.

Example 2:

Extracting email usernames:

Now, let’s say we want to extract the username from each email address (the part before “@”). Again, we’ll use REGEXP_EXTRACT() to isolate this portion of the email string.

  REGEXP_EXTRACT(email, r'([a-zA-Z0-9._%-]+)@') as username

This query extracts and returns the username part of the email addresses, giving you rows of ‘id’, ’email’, and ‘username’.

Understanding the REGEX Pattern: Regular expressions (REGEX) are sequences of characters that define a search pattern. In the examples above, we used ‘r’ before the pattern string, signifying a raw string literal, making backslashes () be treated as literal characters. It’s a common practice when dealing with regex patterns in BigQuery to avoid special character clashes.

The patterns themselves ('https://www.([a-zA-Z0-9-]+).com' and '([a-zA-Z0-9._%-]+)@') use a combination of literal characters and character sets (like [a-zA-Z0-9-]) to match specific portions of the input strings. The parentheses () are capturing groups used to delineate the exact substring you want to extract.

id url domain
1 example
2 another-example
3 yetanotherexample
id email username
1 sachin.s
2 rajesh.nair
3 arun.prakash
BigQuery import urls to refer
Author: user