## pyspark.sql.functions.approx_count_distinct

Pyspark’s **approx_count_distinct** function is a way to approximate the number of unique elements in a column of a DataFrame. It uses a probabilistic algorithm called HyperLogLog to estimate the count of distinct elements in a column, which can be significantly faster than the traditional method of counting distinct elements.

Here’s an example of how to use approx_count_distinct:

```
from pyspark.sql.functions import approx_count_distinct
# Create a simple DataFrame
data = [(1, "foo"), (2, "bar"), (3, "baz"), (4, "foo"), (5, "bar")]
df = spark.createDataFrame(data, ["id", "value"])
# Approximate the number of distinct elements in the "value" column
distinct_count = df.agg(approx_count_distinct("value").alias("distinct_count"))
# Show the result
distinct_count.show()
```

**Output**

```
+--------------+
|distinct_count|
+--------------+
| 3|
+--------------+
```

It’s worth noting that `approx_count_distinct`

is an approximate function, so the result may not be exact. The error rate can be controlled by passing an optional `rsd`

argument, which stands for relative standard deviation. The default value is 0.05, which means that the result will be within 5% of the true distinct count with 99.5% confidence.

For example, you can use the `rsd`

parameter to get a more precise result

```
distinct_count = df.agg(approx_count_distinct("value", rsd=0.01).alias("distinct_count"))
```

This would return the distinct count with 1% error rate with 99% confidence.

Spark important urls