Calculating the average of a set of numerical values in PySpark – avg – Examples included

PySpark @ Freshers.in

PySpark’s avg function is designed for one of the most common data analysis tasks – calculating the average of a set of numerical values. Whether you’re dealing with financial data, sensor readings, or user ratings, avg simplifies the process of computing the mean value efficiently. We’ll explore real-world examples, the advantages of using avg, and the diverse scenarios where it can enhance your data analysis.

The basic syntax of the avg function is:

from pyspark.sql.functions import avg
avg_col = avg(column_name)

Advantages of using PySpark’s avg

1. Scalability

PySpark is renowned for its scalability, enabling you to analyze large datasets effortlessly. The avg function takes full advantage of Spark’s distributed computing capabilities, making it suitable for processing massive amounts of data efficiently.

2. Speed

With the ability to parallelize computation across multiple nodes in a cluster, PySpark’s avg function can significantly reduce processing time. This speed is critical for time-sensitive data analysis tasks or real-time data streaming applications.

3. Accuracy

avg ensures accuracy in aggregating numerical data, as it handles missing values gracefully. It calculates the mean by considering non-null values, reducing the risk of erroneous results due to incomplete or inconsistent data.

Let’s dive into some real-world scenarios where PySpark’s avg function shines.

Example 1: Financial data analysis

Suppose you have a dataset containing daily stock prices, and you want to calculate the average closing price over a specific time period.

from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
spark = SparkSession.builder.appName("avg example 1 @ Freshers.in").getOrCreate()
# Sample DataFrame with stock prices
data = [(1, "2023-01-01", 100.0),
        (2, "2023-01-02", 102.5),
        (3, "2023-01-03", 98.0)]
df = spark.createDataFrame(data, ["day", "date", "closing_price"])
# Calculate average closing price
avg_price = df.select(avg("closing_price")).collect()[0][0]
print(f"Average Closing Price: {avg_price}")

Output

Average Closing Price: 100.16666666666667

Example 2: User ratings

Imagine you have a dataset of user ratings for a product, and you want to determine the average user satisfaction score.

from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
spark = SparkSession.builder.appName("avg 2 @ Freshers.in").getOrCreate()
# Sample DataFrame with user ratings
data = [("User1", 4.0),
        ("User2", 4.5),
        ("User3", 5.0)]
df = spark.createDataFrame(data, ["user", "rating"])
# Calculate average user satisfaction score
avg_rating = df.select(avg("rating")).collect()[0][0]
print(f"Average User Rating: {avg_rating}")

Output

Average User Rating: 4.5

Scenarios / Use case

  1. Financial Analysis: Calculate average prices, returns, or trading volumes for stocks, commodities, or currencies over various time intervals.
  2. User Engagement: Analyze user interactions, such as click-through rates, session durations, or purchase amounts, to understand and improve user engagement.
  3. Quality Assurance: Assess the quality of products or services by computing average customer ratings or feedback scores.
  4. Sensor Data Analysis: Process sensor data from IoT devices to calculate average values, such as temperature, humidity, or pressure, for monitoring and control purposes.
  5. Market Basket Analysis: In retail analytics, calculate the average number of items in a shopping cart to identify buying patterns and optimize product placement.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user