This article provides an in-depth understanding of REGEXP_EXTRACT(), complete with examples you can run directly in BigQuery.
Understanding REGEXP_EXTRACT():
The REGEXP_EXTRACT() function in BigQuery uses regular expressions to capture specific patterns within a text field. The function returns the part of the string matching the regular expression pattern you specify. Its basic syntax is:
REGEXP_EXTRACT(expression, regex_pattern)
Here, ‘expression‘ is the string from which you want to extract data, and ‘regex_pattern’ is the regular expression defining the pattern of the data to be extracted.
Let’s dissect this with practical examples. For the following examples, we’ll use a mock dataset representing web traffic data.
Sample data: First, let’s create a sample dataset and table in BigQuery. You can run the following commands in the BigQuery console:
-- Create a dataset
CREATE SCHEMA IF NOT EXISTS web_data;
-- Create a table
CREATE TABLE IF NOT EXISTS web_data.WebTraffic (
id INT64,
url STRING,
email STRING
);
-- Insert sample data
INSERT INTO web_data.WebTraffic (id, url, email) VALUES
(1, 'https://www.example.com/query?user=sachin', 'sachin.s@example.com'),
(2, 'https://www.another-example.com/query?user=rajesh', 'rajesh.nair@another-example.com'),
(3, 'https://www.yetanotherexample.com/query?user=arun', 'arun.prakash@yetanotherexample.com');
Example 1:
Extracting domain names:
Suppose we want to extract the domain names from the URLs in our data. We’ll use REGEXP_EXTRACT() to specify a pattern that captures the domain name from each URL.
SELECT
id,
url,
REGEXP_EXTRACT(url, r'https://www.([a-zA-Z0-9-]+).com') as domain
FROM
web_data.WebTraffic;
This query will return the ‘id’ and ‘url’, along with the domain names extracted from the URLs.
Example 2:
Extracting email usernames:
Now, let’s say we want to extract the username from each email address (the part before “@”). Again, we’ll use REGEXP_EXTRACT() to isolate this portion of the email string.
SELECT
id,
email,
REGEXP_EXTRACT(email, r'([a-zA-Z0-9._%-]+)@') as username
FROM
web_data.WebTraffic;
This query extracts and returns the username part of the email addresses, giving you rows of ‘id’, ’email’, and ‘username’.
Understanding the REGEX Pattern: Regular expressions (REGEX) are sequences of characters that define a search pattern. In the examples above, we used ‘r’ before the pattern string, signifying a raw string literal, making backslashes () be treated as literal characters. It’s a common practice when dealing with regex patterns in BigQuery to avoid special character clashes.
The patterns themselves ('https://www.([a-zA-Z0-9-]+).com'
and '([a-zA-Z0-9._%-]+)@'
) use a combination of literal characters and character sets (like [a-zA-Z0-9-]
) to match specific portions of the input strings. The parentheses ()
are capturing groups used to delineate the exact substring you want to extract.
id | url | domain |
---|---|---|
1 | https://www.example.com/query?user=sachin | example |
2 | https://www.another-example.com/query?user=rajesh | another-example |
3 | https://www.yetanotherexample.com/query?user=arun | yetanotherexample |
id | username | |
---|---|---|
1 | sachin.s@example.com | sachin.s |
2 | rajesh.nair@another-example.com | rajesh.nair |
3 | arun.prakash@yetanotherexample.com | arun.prakash |