PySpark : How to perform compute covariance using covar_pop and covar_samp with PySpark

user September 27, 2023 Leave a Comment

Covariance is a statistical measure that indicates the extent to which two variables change together. If the variables increase and decrease simultaneously, the covariance is positive. If one variable increases when the other decreases, the covariance is negative.

covar_pop Vs covar_samp

covar_pop: Calculates the population covariance between two columns. It is calculated as:
cov(X,Y) = (1/N) ∑(xi−xˉ)(yi−yˉ)
covar_samp: Calculates the sample covariance between two columns. It is calculated as:

Example using Website Analytics Data

Assume we have website analytics data for a hypothetical website freshers.in, with the following schema:

user_id: Identifier for the user.

session_duration: The time (in minutes) user spent on the website.

pages_visited: The number of pages visited by the user during the session.

Sample Data:

+-------+---------------+-------------+
|user_id|session_duration|pages_visited|
+-------+---------------+-------------+
|   1   |      10.0     |      5      |
|   2   |      15.0     |      7      |
|   3   |      20.0     |      9      |
|   4   |      25.0     |     10      |
|   5   |      30.0     |     12      |
+-------+---------------+-------------+

Python PySpark Script

To perform covar_pop and covar_samp on the above data, use the following PySpark script:

from pyspark.sql import SparkSession
from pyspark.sql import Row
# Initialize a SparkSession
spark = SparkSession.builder.appName("Covariance Example").getOrCreate()
# Sample Data
data = [
    Row(user_id=1, session_duration=10.0, pages_visited=5),
    Row(user_id=2, session_duration=15.0, pages_visited=7),
    Row(user_id=3, session_duration=20.0, pages_visited=9),
    Row(user_id=4, session_duration=25.0, pages_visited=10),
    Row(user_id=5, session_duration=30.0, pages_visited=12)
]
# Define Schema and Create DataFrame
schema = ["user_id", "session_duration", "pages_visited"]
df = spark.createDataFrame(data, schema=schema)
# Compute Sample Covariance
covar_samp = df.stat.cov('session_duration', 'pages_visited')
# Compute Population Covariance
n = df.count()  # number of rows in DataFrame
covar_pop = (n - 1) / n * covar_samp  # adjust sample covariance to find population covariance
# Show the results
print("Population Covariance: ", covar_pop)
print("Sample Covariance: ", covar_samp)
# Stop the SparkSession
spark.stop()

Output

Population Covariance:  17.000000000000004
Sample Covariance:  21.250000000000004

Executing the above script will give you the covariance between session_duration and pages_visited. Here are hypothetical results (actual results will vary):

Explanation:
covar_samp: It is directly calculated using the df.stat.cov(‘session_duration’, ‘pages_visited’) method.
covar_pop: It is derived from the sample covariance. To adjust the sample covariance to find the population covariance, you can use the formula:

covar_pop=[(n-1)/n ]×covar_samp

$n$ is the number of data points (rows in the DataFrame).

Spark important urls to refer

Post Views: 3

Author: user

PySpark : How to perform compute covariance using covar_pop and covar_samp with PySpark

covar_pop Vs covar_samp

Example using Website Analytics Data

Sample Data:

Python PySpark Script

Output

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

covar_pop Vs covar_samp

Example using Website Analytics Data

Sample Data:

Python PySpark Script

Output

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget