PySpark : Reference a column in a DataFrame – col

user November 20, 2023

In the world of PySpark, efficient data manipulation and transformation are key to handling big data. The col function plays a pivotal role in this process. This article provides an in-depth look at col, its advantages, and its practical application through a real-world example.

The col function in PySpark is used to reference a column in a DataFrame by its name. It is a cornerstone for column-based operations such as selection, filtering, and transformations.

Syntax:

from pyspark.sql.functions import col

Advantages of using `col`

Simplicity and Readability: Enhances code readability by allowing column reference using column names.

Flexibility in Data Manipulation: Facilitates various DataFrame operations like sorting, grouping, and aggregating.

Ease of Column Operations: Enables complex expressions and calculations on DataFrame columns.

Use case: Analyzing customer data

Scenario

Consider a dataset containing customer names and their respective scores in a loyalty program.

Objective

Our goal is to filter out customers with scores above a certain threshold and calculate their average score.

Sample data creation

Let’s start by creating a DataFrame with sample customer data.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Initialize Spark Session
spark = SparkSession.builder.appName("col_function_example").getOrCreate()
# Sample data
data = [("Sachin", 85),
        ("Ram", 90),
        ("Raju", 70),
        ("David", 95),
        ("Wilson", 65)]
# Define schema
schema = ["Name", "Score"]
# Create DataFrame
df = spark.createDataFrame(data, schema)

Applying `col` for data analysis

We’ll use col to filter and perform calculations on the DataFrame.

# Filtering customers with scores above 80
high_scorers = df.filter(col("Score") > 80)
# Showing the filtered data
high_scorers.show()
# Calculating the average score of high scorers
avg_score = high_scorers.groupBy().avg("Score")
# Showing the average score
avg_score.show()

The high_scorers DataFrame will list customers with scores above 80.

The avg_score will display the average score of these high-scoring customers.

Output

+------+-----+
|  Name|Score|
+------+-----+
|Sachin|   85|
|   Ram|   90|
| David|   95|
+------+-----+

+----------+
|avg(Score)|
+----------+
|      90.0|
+----------+

Spark important urls to refer

Post Views: 10

Author: user

PySpark : Reference a column in a DataFrame – col

Advantages of using `col`

Use case: Analyzing customer data

Scenario

Objective

Sample data creation

Applying `col` for data analysis

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

Advantages of using col

Use case: Analyzing customer data

Scenario

Objective

Sample data creation

Applying col for data analysis

Related Articles

Trending

Recent Posts

Featured Posts – Slider Widget

Advantages of using `col`

Applying `col` for data analysis