Dividing an ordered dataset into a specified number of approximately equal segments using PySpark

user November 24, 2023

The ntile function in PySpark is used for dividing an ordered dataset into a specified number of approximately equal segments, or “tiles”. It’s particularly useful in scenarios involving percentile calculations, data stratification, or when dividing a dataset into quantiles.The ntile function in PySpark is invaluable for data analysts and scientists looking to segment data effectively. This article aims to demystify the ntile function with a comprehensive guide, bolstered by a practical example.

Syntax:

from pyspark.sql.window import Window
from pyspark.sql.functions import ntile
windowSpec = Window.orderBy("column_to_order")
df.withColumn("tile_column", ntile(number_of_tiles).over(windowSpec))

Let’s consider an example where we have a dataset of employees with their respective salaries. We aim to segment this data into 4 quartiles based on their salary.

Sample data

Suppose we have the following data in a DataFrame named employee_df:

Name	Salary
Sachin	70000
Manju	80000
Ram	55000
Raju	65000
David	72000
Wilson	60000

Code

from pyspark.sql import SparkSession
from pyspark.sql.functions import ntile
from pyspark.sql.window import Window
from pyspark.sql.types import *
# Initialize Spark Session
spark = SparkSession.builder.appName("NtileExample").getOrCreate()
# Sample data
data = [("Sachin", 70000),
        ("Manju", 80000),
        ("Ram", 55000),
        ("Raju", 65000),
        ("David", 72000),
        ("Wilson", 60000)]
# Define schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Salary", IntegerType(), True)
])
# Create DataFrame
employee_df = spark.createDataFrame(data, schema)
# Define Window Specification
windowSpec = Window.orderBy(employee_df["Salary"])
# Apply ntile function
employee_df_with_quartiles = employee_df.withColumn("Quartile", ntile(4).over(windowSpec))
# Show results
employee_df_with_quartiles.show()

Output

+------+------+--------+
|  Name|Salary|Quartile|
+------+------+--------+
|   Ram| 55000|       1|
|Wilson| 60000|       1|
|  Raju| 65000|       2|
|Sachin| 70000|       2|
| David| 72000|       3|
| Manju| 80000|       4|
+------+------+--------+

The output will display the original data along with a new column, Quartile. This column indicates the quartile to which each employee belongs based on their salary, effectively dividing the dataset into four segments.

Spark important urls to refer

Post Views: 2

Author: user

Dividing an ordered dataset into a specified number of approximately equal segments using PySpark

Syntax:

Sample data

Code

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

Syntax:

Sample data

Code

Related Articles

Trending

Recent Posts

Featured Posts – Slider Widget