Calculating correlation between dataframe columns with PySpark : corr

user October 22, 2023

In data analysis, understanding the relationship between different data columns can be pivotal in making informed decisions. Correlation is a statistical measure that expresses the extent to which two variables move in relation to each other. In this article, we explore how to calculate the correlation of two columns in a PySpark DataFrame using the corr function, which returns the correlation coefficient as a double value. The corr function in PySpark is a handy tool that allows data scientists and engineers to calculate the Pearson Correlation Coefficient quickly, even on large datasets, thanks to Spark’s distributed computing capabilities. This example should provide a clear guide on how to implement and interpret correlation calculations in your data analysis tasks using PySpark.

Creating a DataFrame with sample data:

Create a sample DataFrame with hardcoded values. Here, we are simulating data that could represent two related phenomena (e.g., hours studied vs. test scores).

from pyspark.sql import SparkSession
from pyspark.sql import Row
spark = SparkSession.builder \
    .appName("Correlation Calculation") \
    .getOrCreate()
data = [
    Row(hours_studied=10, test_score=75),
    Row(hours_studied=15, test_score=80),
    Row(hours_studied=20, test_score=90),
    Row(hours_studied=25, test_score=95),
    Row(hours_studied=30, test_score=97)
]
df = spark.createDataFrame(data)
df.show()

Output

+-------------+----------+
|hours_studied|test_score|
+-------------+----------+
|           10|        75|
|           15|        80|
|           20|        90|
|           25|        95|
|           30|        97|
+-------------+----------+

Calculating Correlation with the corr function:

PySpark SQL provides the corr function to calculate the Pearson Correlation Coefficient between two columns. Use the select method to apply the corr function:

from pyspark.sql.functions import corr
correlation = df.select(corr("hours_studied", "test_score").alias("correlation")).collect()[0]["correlation"]
print(f"Pearson Correlation Coefficient: {correlation}")

This will calculate and print the Pearson Correlation Coefficient, which is a value between -1 and 1. A value closer to 1 indicates a strong positive correlation, while a value closer to -1 indicates a strong negative correlation.

Output

Pearson Correlation Coefficient: 0.9763075036742054

Spark important urls to refer

Post Views: 38

Author: user

Calculating correlation between dataframe columns with PySpark : corr

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

Related Articles

Trending

Recent Posts

Featured Posts – Slider Widget