Computing the average value of a numeric column in PySpark

The mean function in PySpark is used to compute the average value of a numeric column. This function is part of PySpark’s aggregate functions, which are essential in statistical analysis. This article explores the mean function in PySpark, its benefits, and its practical application through a real-world example. The mean function in PySpark is a powerful tool for statistical analysis, offering a simple yet effective way to understand the central tendency of numerical data.

The syntax for mean is:

from pyspark.sql.functions import mean

Advantages of using mean

  • Statistical Insights: Provides a quick overview of the central tendency of numeric data.
  • Data Reduction: Summarizes large datasets into a single representative value.
  • Versatility: Can be used in various contexts, from financial analysis to scientific research.

Example : Analyzing employee salaries

Consider a dataset with the names of employees and their salaries. Our goal is to calculate the average salary.

Dataset

NameSalary
Sachin70000
Ram48000
Raju54000
David62000
Wilson58000

Objective

Compute the average salary of the employees.

Implementation in PySpark

Setting up the PySpark environment and creating the DataFrame:

from pyspark.sql import SparkSession
from pyspark.sql.functions import mean
# Initialize Spark Session
spark = SparkSession.builder.appName("Mean Example").getOrCreate()
# Sample Data
data = [("Sachin", 70000), ("Ram", 48000), ("Raju", 54000), ("David", 62000), ("Wilson", 58000)]
# Creating DataFrame
df = spark.createDataFrame(data, ["Name", "Salary"])
df.show()
Output

Output

+------+------+
|  Name|Salary|
+------+------+
|Sachin| 70000|
|   Ram| 48000|
|  Raju| 54000|
| David| 62000|
|Wilson| 58000|
+------+------+
Applying the mean function:
# Calculating Mean Salary
mean_salary = df.select(mean("Salary")).collect()[0][0]
print("Average Salary:", mean_salary)

Output

Average Salary: 58400.0
Output
Average Salary: 58400.0

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page