Identifying the Maximum value among columns with PySpark’s greatest function

PySpark @ Freshers.in

When managing data in PySpark, it’s often useful to compare values across columns to determine the highest value for each row. The greatest function in PySpark serves this purpose. It evaluates multiple columns and returns the largest value for each row, conveniently skipping any null values. In this tutorial, we’ll demonstrate how to utilize the greatest function in PySpark with a hands-on example. PySpark’s greatest function offers a seamless and efficient method to identify the largest value across multiple columns in a DataFrame, making it invaluable for comparative data analyses.

For the purpose of this tutorial, let’s set up a sample DataFrame containing grades of students in different subjects:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("Greatest Function Demonstration") \
    .getOrCreate()
from pyspark.sql import Row
data = [
    Row(name="Sachin", math=85, physics=90, chemistry=88),
    Row(name="Manu", math=92, physics=None, chemistry=89),
    Row(name="Bobby", math=88, physics=86, chemistry=92),
    Row(name="Kabir", math=None, physics=90, chemistry=90)
]
df = spark.createDataFrame(data)
df.show()

Output

+------+----+-------+---------+
|  name|math|physics|chemistry|
+------+----+-------+---------+
|Sachin|  85|     90|       88|
|  Manu|  92|   NULL|       89|
| Bobby|  88|     86|       92|
| Kabir|NULL|     90|       90|
+------+----+-------+---------+

Use the greatest function to identify the highest score for each student:

Now, let’s employ the greatest function to determine the highest grade each student received amongst the subjects:

from pyspark.sql.functions import greatest
df_with_greatest = df.select("name", greatest("math", "physics", "chemistry").alias("highest_grade"))
df_with_greatest.show()
+------+-------------+
|  name|highest_grade|
+------+-------------+
|Sachin|           90|
|  Manu|           92|
| Bobby|           92|
| Kabir|           90|

The greatest function inspects the “math”, “physics”, and “chemistry” columns, returning the maximum value in a new column named “highest_grade”.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user