Covariance is a statistical measure that indicates the extent to which two variables change together. If the variables increase and decrease simultaneously, the covariance is positive. If one variable increases when the other decreases, the covariance is negative.
covar_pop Vs covar_samp
- covar_pop: Calculates the population covariance between two columns. It is calculated as:
cov(X,Y) = (1/N) ∑(xi−xˉ)(yi−yˉ) - covar_samp: Calculates the sample covariance between two columns. It is calculated as: cov(X, Y)=cov(X,Y) = (1/N-1) ∑(xi−xˉ)(yi−yˉ)
Example using Website Analytics Data
Assume we have website analytics data for a hypothetical website freshers.in, with the following schema:
user_id
: Identifier for the user.
session_duration
: The time (in minutes) user spent on the website.
pages_visited
: The number of pages visited by the user during the session.
Sample Data:
+-------+---------------+-------------+
|user_id|session_duration|pages_visited|
+-------+---------------+-------------+
| 1 | 10.0 | 5 |
| 2 | 15.0 | 7 |
| 3 | 20.0 | 9 |
| 4 | 25.0 | 10 |
| 5 | 30.0 | 12 |
+-------+---------------+-------------+
Python PySpark Script
To perform covar_pop and covar_samp on the above data, use the following PySpark script:
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Initialize a SparkSession
spark = SparkSession.builder.appName("Covariance Example").getOrCreate()
# Sample Data
data = [
Row(user_id=1, session_duration=10.0, pages_visited=5),
Row(user_id=2, session_duration=15.0, pages_visited=7),
Row(user_id=3, session_duration=20.0, pages_visited=9),
Row(user_id=4, session_duration=25.0, pages_visited=10),
Row(user_id=5, session_duration=30.0, pages_visited=12)
]
# Define Schema and Create DataFrame
schema = ["user_id", "session_duration", "pages_visited"]
df = spark.createDataFrame(data, schema=schema)
# Compute Sample Covariance
covar_samp = df.stat.cov('session_duration', 'pages_visited')
# Compute Population Covariance
n = df.count() # number of rows in DataFrame
covar_pop = (n - 1) / n * covar_samp # adjust sample covariance to find population covariance
# Show the results
print("Population Covariance: ", covar_pop)
print("Sample Covariance: ", covar_samp)
# Stop the SparkSession
spark.stop()
Output
Population Covariance: 17.000000000000004
Sample Covariance: 21.250000000000004
Executing the above script will give you the covariance between session_duration and pages_visited. Here are hypothetical results (actual results will vary):
Explanation:
covar_samp: It is directly calculated using the df.stat.cov(‘session_duration’, ‘pages_visited’) method.
covar_pop: It is derived from the sample covariance. To adjust the sample covariance to find the population covariance, you can use the formula:
is the number of data points (rows in the DataFrame).
Spark important urls to refer