PySpark : Correlation Analysis in PySpark with a detailed example

user April 11, 2023 Leave a Comment

In this article, we will explore correlation analysis in PySpark, a statistical technique used to measure the strength and direction of the relationship between two continuous variables. We will provide a detailed example using hardcoded values as input.

Prerequisites

Python 3.7 or higher
PySpark library
Java 8 or higher

Creating a PySpark DataFrame with Hardcoded Values

First, let’s create a PySpark DataFrame with hardcoded values:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DoubleType

spark = SparkSession.builder \
    .appName("Correlation Analysis Example") \
    .getOrCreate()

data_schema = StructType([
    StructField("name", StringType(), True),
    StructField("variable1", DoubleType(), True),
    StructField("variable2", DoubleType(), True),
])

data = spark.createDataFrame([
    ("A", 1.0, 2.0),
    ("B", 2.0, 3.0),
    ("C", 3.0, 4.0),
    ("D", 4.0, 5.0),
    ("E", 5.0, 6.0),
], data_schema)

data.show()

Output

+----+---------+---------+
|name|variable1|variable2|
+----+---------+---------+
|   A|      1.0|      2.0|
|   B|      2.0|      3.0|
|   C|      3.0|      4.0|
|   D|      4.0|      5.0|
|   E|      5.0|      6.0|
+----+---------+---------+

Calculating Correlation

Now, let’s calculate the correlation between variable1 and variable2:

from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler

vector_assembler = VectorAssembler(inputCols=["variable1", "variable2"], outputCol="features")
data_vector = vector_assembler.transform(data).select("features")

correlation_matrix = Correlation.corr(data_vector, "features").collect()[0][0]
correlation_value = correlation_matrix[0, 1]
print(f"Correlation between variable1 and variable2: {correlation_value:.2f}")

Output

Correlation between variable1 and variable2: 1.00

In this example, we used the VectorAssembler to combine the two variables into a single feature vector column called features. Then, we used the Correlation module from pyspark.ml.stat to calculate the correlation between the two variables. The corr function returns a correlation matrix, from which we can extract the correlation value between variable1 and variable2.

Interpreting the Results

The correlation value ranges from -1 to 1, where:

-1 indicates a strong negative relationship
0 indicates no relationship
1 indicates a strong positive relationship

In our example, the correlation value is 1.0, which indicates a strong positive relationship between variable1 and variable2. This means that as variable1 increases, variable2 also increases, and vice versa.

In this article, we explored correlation analysis in PySpark, a statistical technique used to measure the strength and direction of the relationship between two continuous variables. We provided a detailed example using hardcoded values as input, showcasing how to create a DataFrame, calculate the correlation between two variables, and interpret the results. Correlation analysis can be useful in various fields, such as finance, economics, and social sciences, to understand the relationships between variables and make data-driven decisions.

Spark important urls to refer

Post Views: 88

Author: user

PySpark : Correlation Analysis in PySpark with a detailed example

Calculating Correlation

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Most Viewed Posts

Calculating Correlation

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget