# Computing the average value of a numeric column in PySpark

The mean function in PySpark is used to compute the average value of a numeric column. This function is part of PySpark’s aggregate functions, which are essential in statistical analysis. This article explores the mean function in PySpark, its benefits, and its practical application through a real-world example. The mean function in PySpark is a powerful tool for statistical analysis, offering a simple yet effective way to understand the central tendency of numerical data.

The syntax for mean is:

from pyspark.sql.functions import mean


• Statistical Insights: Provides a quick overview of the central tendency of numeric data.
• Data Reduction: Summarizes large datasets into a single representative value.
• Versatility: Can be used in various contexts, from financial analysis to scientific research.

### Example : Analyzing employee salaries

Consider a dataset with the names of employees and their salaries. Our goal is to calculate the average salary.

Name Salary
Sachin 70000
Ram 48000
Raju 54000
David 62000
Wilson 58000

### Objective

Compute the average salary of the employees.

### Implementation in PySpark

Setting up the PySpark environment and creating the DataFrame:

from pyspark.sql import SparkSession
from pyspark.sql.functions import mean
# Initialize Spark Session
spark = SparkSession.builder.appName("Mean Example").getOrCreate()
# Sample Data
data = [("Sachin", 70000), ("Ram", 48000), ("Raju", 54000), ("David", 62000), ("Wilson", 58000)]
# Creating DataFrame
df = spark.createDataFrame(data, ["Name", "Salary"])
df.show()

Output

+------+------+
|  Name|Salary|
+------+------+
|Sachin| 70000|
|   Ram| 48000|
|  Raju| 54000|
| David| 62000|
|Wilson| 58000|
+------+------+

Applying the mean function:

# Calculating Mean Salary
mean_salary = df.select(mean("Salary")).collect()[0][0]
print("Average Salary:", mean_salary)

Output

Average Salary: 58400.0

Spark important urls to refer

Author: user