PySpark : How to Compute the cumulative distribution of a column in a DataFrame

user February 3, 2023 Leave a Comment

pyspark.sql.functions.cume_dist

The cumulative distribution is a method used in probability and statistics to determine the distribution of a random variable, X, at any given point. The cumulative distribution function (CDF) of X, denoted by F(x), is defined as the probability that X will take a value less than or equal to x.

In PySpark, the cume_dist function is used to compute the cumulative distribution of a column in a DataFrame. This function computes the cumulative distribution of a column in a DataFrame, with respect to the order specified in the sort order.

Here’s an example to demonstrate the usage of the cume_dist function in PySpark:

from pyspark.sql import SparkSession
<code class="sig-prename descclassname">from pyspark.sql.functions import ume_dist # Initialize Spark session spark = SparkSession.builder.appName("CumeDistExample").getOrCreate() # Create a DataFrame with sample data data = [("Mike Jack", 30), ("King Elene", 40), ("Barry Tim", 25), ("Yang Jakie", 35), ("Joby John", 20)] df = spark.createDataFrame(data, ["Name", "Age"]) # Sort the DataFrame by Age in ascending order df = df.sort("Age") # Compute the cumulative distribution of the Age column cumulative_dist = df.selectExpr("cume_dist() over (order by Age) as cum_dist").show()

Output

+--------+
|cum_dist|
+--------+
|     0.2|
|     0.4|
|     0.6|
|     0.8|
|     1.0|
+--------+

In this example, the cumulative distribution of the Age column is calculated with respect to the ascending order of the column. The result shows the cumulative distribution of the Age column, with the first row having a cumulative distribution of 0.2, and the last row having a cumulative distribution of 1.0, which indicates that 100% of the values are less than or equal to the corresponding value.

Spark important urls to refer

Post Views: 431

Author: user

PySpark : How to Compute the cumulative distribution of a column in a DataFrame

pyspark.sql.functions.cume_dist

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Most Viewed Posts

pyspark.sql.functions.cume_dist

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget