Explain dense_rank. How to use dense_rank function in PySpark ?

PySpark @ Freshers.in

In PySpark, the dense_rank function is used to assign a rank to each row within a result set, based on the values of one or more columns. It is a window function that assigns a unique rank to each unique value within a result set, with no gaps in the ranking values.

The dense_rank function is a window function that assigns a rank to each row within a result set, based on the values in one or more columns. The rank assigned is unique and dense, meaning that there are no gaps in the sequence of rank values. For example, if there are three rows with the same value in the column used for ranking, they will be assigned the same rank, and the next row will be assigned the rank that is three greater than the previous rank. The dense_rankĀ  function is typically used in conjunction with an ORDER BY clause to sort the result set by the column(s) used for ranking.

Here is an example of how to use the dense_rank function in PySpark:
from pyspark.sql import SparkSession
from pyspark.sql import Window
from pyspark.sql.functions import dense_rank, col

spark = SparkSession.builder.appName("dense_rank").getOrCreate()
data = [("Peter John", 25),("Wisdon Mike", 30),("Sarah Johns", 25),("Bob Beliver", 22),("Lucas Marget", 30)]

df = spark.createDataFrame(data, ["name", "age"])
df2 = df.select("name", "age", dense_rank().\
over(Window.partitionBy("age").\
orderBy("name")).\
alias("rank"))
df2.show()
In this example, the dense_rank function is used to assign a unique rank to each unique value of the “age” column, based on the order of the “name” column. The output will be
+------------+---+----+
|        name|age|rank|
+------------+---+----+
| Bob Beliver| 22|   1|
|  Peter John| 25|   1|
| Sarah Johns| 25|   2|
|Lucas Marget| 30|   1|
| Wisdon Mike| 30|   2|
+------------+---+----+

This means that Peter John and Sarah Johns have the same age with Peter John having 1st rank and Sarah Johns having 2nd rank.

Author: user

Leave a Reply