Dividing an ordered dataset into a specified number of approximately equal segments using PySpark

The ntile function in PySpark is used for dividing an ordered dataset into a specified number of approximately equal segments, or “tiles”. It’s particularly useful in scenarios involving percentile calculations, data stratification, or when dividing a dataset into quantiles.The ntile function in PySpark is invaluable for data analysts and scientists looking to segment data effectively. This article aims to demystify the ntile function with a comprehensive guide, bolstered by a practical example.

Syntax:

from pyspark.sql.window import Window
from pyspark.sql.functions import ntile
windowSpec = Window.orderBy("column_to_order")
df.withColumn("tile_column", ntile(number_of_tiles).over(windowSpec))


Let’s consider an example where we have a dataset of employees with their respective salaries. We aim to segment this data into 4 quartiles based on their salary.

Sample data

Suppose we have the following data in a DataFrame named employee_df:

Name Salary
Sachin 70000
Manju 80000
Ram 55000
Raju 65000
David 72000
Wilson 60000

Code

from pyspark.sql import SparkSession
from pyspark.sql.functions import ntile
from pyspark.sql.window import Window
from pyspark.sql.types import *
# Initialize Spark Session
spark = SparkSession.builder.appName("NtileExample").getOrCreate()
# Sample data
data = [("Sachin", 70000),
("Manju", 80000),
("Ram", 55000),
("Raju", 65000),
("David", 72000),
("Wilson", 60000)]
# Define schema
schema = StructType([
StructField("Name", StringType(), True),
StructField("Salary", IntegerType(), True)
])
# Create DataFrame
employee_df = spark.createDataFrame(data, schema)
# Define Window Specification
windowSpec = Window.orderBy(employee_df["Salary"])
# Apply ntile function
employee_df_with_quartiles = employee_df.withColumn("Quartile", ntile(4).over(windowSpec))
# Show results
employee_df_with_quartiles.show()

Output

+------+------+--------+
|  Name|Salary|Quartile|
+------+------+--------+
|   Ram| 55000|       1|
|Wilson| 60000|       1|
|  Raju| 65000|       2|
|Sachin| 70000|       2|
| David| 72000|       3|
| Manju| 80000|       4|
+------+------+--------+


The output will display the original data along with a new column, Quartile. This column indicates the quartile to which each employee belongs based on their salary, effectively dividing the dataset into four segments.

Spark important urls to refer

Author: user