Spark : Calculate the number of unique elements in a column using PySpark

pyspark.sql.functions.countDistinct

In PySpark, the countDistinct function is used to calculate the number of unique elements in a column. This is also known as the number of distinct values. After removing duplicate rows, DataFrame distinct() returns a new DataFrame (distinct on all columns). Use the PySpark SQL function countDistinct to obtain the count distinct for a selection of multiple columns (). The result of this function is the number of unique items in a group.

Here is an example of how to use the countDistinct function in PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct
spark = SparkSession.builder.appName('Freshers.in countDistinct Learning').getOrCreate()
data = [("John", "Finance"), ("Jane", "IT"), ("Jim", "Finance"), ("Wilson John", "Travel"), ("Mike", "Travel")]
columns = ["Name","Dept"]
df = spark.createDataFrame(data=data,schema=columns)
# Using countDistrinct()
df.select(countDistinct("Dept",)).show()
Output
+--------------------+
|count(DISTINCT Dept)|
+--------------------+
|                   3|
+--------------------+
Returns a new Column for distinct count of col or cols.
Author: user

Leave a Reply