PySpark : Reference a column in a DataFrame – col

PySpark @ Freshers.in

In the world of PySpark, efficient data manipulation and transformation are key to handling big data. The col function plays a pivotal role in this process. This article provides an in-depth look at col, its advantages, and its practical application through a real-world example.

The col function in PySpark is used to reference a column in a DataFrame by its name. It is a cornerstone for column-based operations such as selection, filtering, and transformations.

Syntax:

from pyspark.sql.functions import col

Advantages of using col

Simplicity and Readability: Enhances code readability by allowing column reference using column names.

Flexibility in Data Manipulation: Facilitates various DataFrame operations like sorting, grouping, and aggregating.

Ease of Column Operations: Enables complex expressions and calculations on DataFrame columns.

Use case: Analyzing customer data

Scenario

Consider a dataset containing customer names and their respective scores in a loyalty program.

Objective

Our goal is to filter out customers with scores above a certain threshold and calculate their average score.

Sample data creation

Let’s start by creating a DataFrame with sample customer data.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Initialize Spark Session
spark = SparkSession.builder.appName("col_function_example").getOrCreate()
# Sample data
data = [("Sachin", 85),
        ("Ram", 90),
        ("Raju", 70),
        ("David", 95),
        ("Wilson", 65)]
# Define schema
schema = ["Name", "Score"]
# Create DataFrame
df = spark.createDataFrame(data, schema)

Applying col for data analysis

We’ll use col to filter and perform calculations on the DataFrame.

# Filtering customers with scores above 80
high_scorers = df.filter(col("Score") > 80)
# Showing the filtered data
high_scorers.show()
# Calculating the average score of high scorers
avg_score = high_scorers.groupBy().avg("Score")
# Showing the average score
avg_score.show()

The high_scorers DataFrame will list customers with scores above 80.

The avg_score will display the average score of these high-scoring customers.

Output
+------+-----+
|  Name|Score|
+------+-----+
|Sachin|   85|
|   Ram|   90|
| David|   95|
+------+-----+
+----------+
|avg(Score)|
+----------+
|      90.0|
+----------+
Author: user