Extracting specific parts of a string that match a given regular expression pattern using PySpark

PySpark @ Freshers.in

The regexp_extract function in PySpark is used for extracting specific parts of a string that match a given regular expression pattern. This function is invaluable in scenarios where data needs to be parsed or subdivided into more manageable components. This article aims to shed light on this function, providing insights into its usage with practical examples.

Syntax:

regexp_extract(str, pattern, idx)

str: The string column to be searched.
pattern: The regular expression pattern defining the part of the string to extract.
idx: The index of the group in the regular expression to extract. Indexing starts from 1.

Example: Extracting Information from Names

Let’s consider an example where we use regexp_extract to extract specific parts from a list of names.

Dataset Example:

Full_Name
Sachin Tendulkar
Ram Nath Kovind
Raju Srivastava
David Beckham
Wilson Raynor

Suppose we want to extract the last names from these full names.

Step-by-Step Implementation:

Initializing the PySpark Environment: Start by setting up your PySpark session and importing the necessary function.

from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_extract
spark = SparkSession.builder.appName("regexp_extract_example").getOrCreate()

Creating the DataFrame:

Create a DataFrame with the given names.

data = [("Sachin Tendulkar",), ("Ram Nath Kovind",), ("Raju Srivastava",), ("David Beckham",), ("Wilson Raynor",)]
df = spark.createDataFrame(data, ["Full_Name"])
df.show()

Applying regexp_extract

We will use a regular expression to extract the last name from each full name.

extracted_df = df.withColumn("Last_Name", regexp_extract("Full_Name", r"(\w+)$", 1))
extracted_df.show()

The regular expression (\w+)$ is designed to capture the last word in the string, which in our case is the last name.

Output:

Full_Name Last_Name
Sachin Tendulkar Tendulkar
Ram Nath Kovind Kovind
Raju Srivastava Srivastava
David Beckham Beckham
Wilson Raynor Raynor

The regexp_extract function in PySpark is a highly efficient tool for extracting specific patterns from strings. Its ability to dissect and retrieve relevant information from text data makes it a valuable asset in any data professional’s toolkit.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user