Converting delimiter-separated strings to array columns using PySpark

PySpark @ Freshers.in

PySpark allows for a seamless and efficient way to handle big data processing and manipulation tasks. In this article, we demonstrated how to convert a delimiter-separated string to an array column in PySpark, using a DataFrame with a prefix of freshers_in. We achieved this using the split function from the pyspark.sql.functions module. The example outlined should give a clear insight into handling similar data transformation needs in PySpark, allowing for more versatile and analytical data processing approaches.

Initializing a SparkSession

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("String to array column @Freshers.in Training Example") \
    .getOrCreate()

Creating sample data

Let’s create a sample DataFrame freshers_in_df to demonstrate the conversion of delimiter-separated strings to array columns.

from pyspark.sql import Row

data = [Row(freshers_in_name='Dhoni Mahinder', freshers_in_subjects='Math,Physics,Chemistry'),
        Row(freshers_in_name='Dhoni Mahinder', freshers_in_subjects='Biology,Physics,Chemistry')]

freshers_in_df = spark.createDataFrame(data)
freshers_in_df.show(truncate=False)

Output

+-------------+------------------------------+
|freshers_in_name|freshers_in_subjects       |
+-------------+------------------------------+
|Dhoni Mahinder     |Math,Physics,Chemistry  |
|Dhoni Mahinder     |Biology,Physics,Chemistry|
+-------------+------------------------------+

Conversion of delimiter-separated string to array column

Let’s say the freshers_in_subjects column in our DataFrame freshers_in_df contains strings with subjects separated by commas. We want to convert this column into an array column, where each element of the array is a subject.

You can achieve this using the split function from the pyspark.sql.functions module. The split function takes two arguments: the name of the column to be split and the delimiter.

from pyspark.sql.functions import split
freshers_in_df = freshers_in_df.withColumn("freshers_in_subjects", split("freshers_in_subjects", ","))
freshers_in_df.show(truncate=False)

Output:

+-------------+-------------------------+
|freshers_in_name|freshers_in_subjects     |
+-------------+-------------------------+
|Dhoni Mahinder     |[Math, Physics, Chemistry]|
|Dhoni Mahinder     |[Biology, Physics, Chemistry]|
+-------------+-------------------------+

Now, the freshers_in_subjects column has been successfully converted from a delimiter-separated string to an array column.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply