PySpark : How to perform compute covariance using covar_pop and covar_samp with PySpark

PySpark @ Freshers.in

Covariance is a statistical measure that indicates the extent to which two variables change together. If the variables increase and decrease simultaneously, the covariance is positive. If one variable increases when the other decreases, the covariance is negative.

covar_pop Vs covar_samp

  • covar_pop: Calculates the population covariance between two columns. It is calculated as:
    cov(X,Y) = (1/N​) (xixˉ)(yiyˉ)
  • covar_samp: Calculates the sample covariance between two columns. It is calculated as: cov(X, Y)=cov(X,Y) = (1/N​-1) (xixˉ)(yiyˉ)

Example using Website Analytics Data

Assume we have website analytics data for a hypothetical website freshers.in, with the following schema:

user_id: Identifier for the user.

session_duration: The time (in minutes) user spent on the website.

pages_visited: The number of pages visited by the user during the session.

Sample Data:

+-------+---------------+-------------+
|user_id|session_duration|pages_visited|
+-------+---------------+-------------+
|   1   |      10.0     |      5      |
|   2   |      15.0     |      7      |
|   3   |      20.0     |      9      |
|   4   |      25.0     |     10      |
|   5   |      30.0     |     12      |
+-------+---------------+-------------+

Python PySpark Script

To perform covar_pop and covar_samp on the above data, use the following PySpark script:

from pyspark.sql import SparkSession
from pyspark.sql import Row
# Initialize a SparkSession
spark = SparkSession.builder.appName("Covariance Example").getOrCreate()
# Sample Data
data = [
    Row(user_id=1, session_duration=10.0, pages_visited=5),
    Row(user_id=2, session_duration=15.0, pages_visited=7),
    Row(user_id=3, session_duration=20.0, pages_visited=9),
    Row(user_id=4, session_duration=25.0, pages_visited=10),
    Row(user_id=5, session_duration=30.0, pages_visited=12)
]
# Define Schema and Create DataFrame
schema = ["user_id", "session_duration", "pages_visited"]
df = spark.createDataFrame(data, schema=schema)
# Compute Sample Covariance
covar_samp = df.stat.cov('session_duration', 'pages_visited')
# Compute Population Covariance
n = df.count()  # number of rows in DataFrame
covar_pop = (n - 1) / n * covar_samp  # adjust sample covariance to find population covariance
# Show the results
print("Population Covariance: ", covar_pop)
print("Sample Covariance: ", covar_samp)
# Stop the SparkSession
spark.stop()

Output

Population Covariance:  17.000000000000004
Sample Covariance:  21.250000000000004

Executing the above script will give you the covariance between session_duration and pages_visited. Here are hypothetical results (actual results will vary):

Explanation:
covar_samp: It is directly calculated using the df.stat.cov(‘session_duration’, ‘pages_visited’) method.
covar_pop: It is derived from the sample covariance. To adjust the sample covariance to find the population covariance, you can use the formula:

covar_pop=[(n-1)/n ]×covar_samp

is the number of data points (rows in the DataFrame).

Author: user

Leave a Reply