Dividing an ordered dataset into a specified number of approximately equal segments using PySpark

PySpark @ Freshers.in

The ntile function in PySpark is used for dividing an ordered dataset into a specified number of approximately equal segments, or “tiles”. It’s particularly useful in scenarios involving percentile calculations, data stratification, or when dividing a dataset into quantiles.The ntile function in PySpark is invaluable for data analysts and scientists looking to segment data effectively. This article aims to demystify the ntile function with a comprehensive guide, bolstered by a practical example.

Syntax:

from pyspark.sql.window import Window
from pyspark.sql.functions import ntile
windowSpec = Window.orderBy("column_to_order")
df.withColumn("tile_column", ntile(number_of_tiles).over(windowSpec))

Let’s consider an example where we have a dataset of employees with their respective salaries. We aim to segment this data into 4 quartiles based on their salary.

Sample data

Suppose we have the following data in a DataFrame named employee_df:

Name Salary
Sachin 70000
Manju 80000
Ram 55000
Raju 65000
David 72000
Wilson 60000

Code

from pyspark.sql import SparkSession
from pyspark.sql.functions import ntile
from pyspark.sql.window import Window
from pyspark.sql.types import *
# Initialize Spark Session
spark = SparkSession.builder.appName("NtileExample").getOrCreate()
# Sample data
data = [("Sachin", 70000),
        ("Manju", 80000),
        ("Ram", 55000),
        ("Raju", 65000),
        ("David", 72000),
        ("Wilson", 60000)]
# Define schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Salary", IntegerType(), True)
])
# Create DataFrame
employee_df = spark.createDataFrame(data, schema)
# Define Window Specification
windowSpec = Window.orderBy(employee_df["Salary"])
# Apply ntile function
employee_df_with_quartiles = employee_df.withColumn("Quartile", ntile(4).over(windowSpec))
# Show results
employee_df_with_quartiles.show()

Output

+------+------+--------+
|  Name|Salary|Quartile|
+------+------+--------+
|   Ram| 55000|       1|
|Wilson| 60000|       1|
|  Raju| 65000|       2|
|Sachin| 70000|       2|
| David| 72000|       3|
| Manju| 80000|       4|
+------+------+--------+

The output will display the original data along with a new column, Quartile. This column indicates the quartile to which each employee belongs based on their salary, effectively dividing the dataset into four segments.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user