Calculating correlation between dataframe columns with PySpark : corr

PySpark @ Freshers.in

In data analysis, understanding the relationship between different data columns can be pivotal in making informed decisions. Correlation is a statistical measure that expresses the extent to which two variables move in relation to each other. In this article, we explore how to calculate the correlation of two columns in a PySpark DataFrame using the corr function, which returns the correlation coefficient as a double value. The corr function in PySpark is a handy tool that allows data scientists and engineers to calculate the Pearson Correlation Coefficient quickly, even on large datasets, thanks to Spark’s distributed computing capabilities. This example should provide a clear guide on how to implement and interpret correlation calculations in your data analysis tasks using PySpark.

Creating a DataFrame with sample data:

Create a sample DataFrame with hardcoded values. Here, we are simulating data that could represent two related phenomena (e.g., hours studied vs. test scores).

from pyspark.sql import SparkSession
from pyspark.sql import Row
spark = SparkSession.builder \
    .appName("Correlation Calculation") \
    .getOrCreate()
data = [
    Row(hours_studied=10, test_score=75),
    Row(hours_studied=15, test_score=80),
    Row(hours_studied=20, test_score=90),
    Row(hours_studied=25, test_score=95),
    Row(hours_studied=30, test_score=97)
]
df = spark.createDataFrame(data)
df.show()

Output

+-------------+----------+
|hours_studied|test_score|
+-------------+----------+
|           10|        75|
|           15|        80|
|           20|        90|
|           25|        95|
|           30|        97|
+-------------+----------+

Calculating Correlation with the corr function:

PySpark SQL provides the corr function to calculate the Pearson Correlation Coefficient between two columns. Use the select method to apply the corr function:

from pyspark.sql.functions import corr
correlation = df.select(corr("hours_studied", "test_score").alias("correlation")).collect()[0]["correlation"]
print(f"Pearson Correlation Coefficient: {correlation}")

This will calculate and print the Pearson Correlation Coefficient, which is a value between -1 and 1. A value closer to 1 indicates a strong positive correlation, while a value closer to -1 indicates a strong negative correlation.

Output

Pearson Correlation Coefficient: 0.9763075036742054

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user