In data analysis, understanding the relationship between different data columns can be pivotal in making informed decisions. Correlation is a statistical measure that expresses the extent to which two variables move in relation to each other. In this article, we explore how to calculate the correlation of two columns in a PySpark DataFrame using the `corr`

function, which returns the correlation coefficient as a double value. The **corr** function in PySpark is a handy tool that allows data scientists and engineers to calculate the Pearson Correlation Coefficient quickly, even on large datasets, thanks to Spark’s distributed computing capabilities. This example should provide a clear guide on how to implement and interpret correlation calculations in your data analysis tasks using PySpark.

**Creating a DataFrame with sample data:**

Create a sample DataFrame with hardcoded values. Here, we are simulating data that could represent two related phenomena (e.g., hours studied vs. test scores).

```
from pyspark.sql import SparkSession
from pyspark.sql import Row
spark = SparkSession.builder \
.appName("Correlation Calculation") \
.getOrCreate()
data = [
Row(hours_studied=10, test_score=75),
Row(hours_studied=15, test_score=80),
Row(hours_studied=20, test_score=90),
Row(hours_studied=25, test_score=95),
Row(hours_studied=30, test_score=97)
]
df = spark.createDataFrame(data)
df.show()
```

Output

```
+-------------+----------+
|hours_studied|test_score|
+-------------+----------+
| 10| 75|
| 15| 80|
| 20| 90|
| 25| 95|
| 30| 97|
+-------------+----------+
```

**Calculating Correlation with the corr function:**

PySpark SQL provides the corr function to calculate the Pearson Correlation Coefficient between two columns. Use the select method to apply the corr function:

```
from pyspark.sql.functions import corr
correlation = df.select(corr("hours_studied", "test_score").alias("correlation")).collect()[0]["correlation"]
print(f"Pearson Correlation Coefficient: {correlation}")
```

This will calculate and print the Pearson Correlation Coefficient, which is a value between -1 and 1. A value closer to 1 indicates a strong positive correlation, while a value closer to -1 indicates a strong negative correlation.

**Output**

`Pearson Correlation Coefficient: 0.9763075036742054`

Spark important urls to refer