PySpark:Getting approximate number of unique elements in a column of a DataFrame

PySpark @ Freshers.in

pyspark.sql.functions.approx_count_distinct

Pyspark’s approx_count_distinct function is a way to approximate the number of unique elements in a column of a DataFrame. It uses a probabilistic algorithm called HyperLogLog to estimate the count of distinct elements in a column, which can be significantly faster than the traditional method of counting distinct elements.

Here’s an example of how to use approx_count_distinct:

from pyspark.sql.functions import approx_count_distinct

# Create a simple DataFrame
data = [(1, "foo"), (2, "bar"), (3, "baz"), (4, "foo"), (5, "bar")]
df = spark.createDataFrame(data, ["id", "value"])

# Approximate the number of distinct elements in the "value" column
distinct_count = df.agg(approx_count_distinct("value").alias("distinct_count"))

# Show the result
distinct_count.show()

Output

+--------------+
|distinct_count|
+--------------+
|             3|
+--------------+

It’s worth noting that approx_count_distinct is an approximate function, so the result may not be exact. The error rate can be controlled by passing an optional rsd argument, which stands for relative standard deviation. The default value is 0.05, which means that the result will be within 5% of the true distinct count with 99.5% confidence.

For example, you can use the rsd parameter to get a more precise result

distinct_count = df.agg(approx_count_distinct("value", rsd=0.01).alias("distinct_count"))

This would return the distinct count with 1% error rate with 99% confidence.

Author: user

Leave a Reply