PySpark : Sort an array of elements in a DataFrame column

PySpark @


The array_sort function is a PySpark function that allows you to sort an array of elements in a DataFrame column. The function takes a column name as its first argument, and returns a new DataFrame with the column sorted in ascending order.

Here’s an example of how to use the array_sort function:

from pyspark.sql import SparkSession
from pyspark.sql.functions import array_sort
# Create a SparkSession
spark = SparkSession.builder.appName("PySparkArraySort").getOrCreate()
# Create a DataFrame with sample data
data = [("Twinkle Baby", [3, 2, 1]), ("Barry Tim", [5, 4, 3]), ("King Mandate", [7, 6, 5])]
df = spark.createDataFrame(data, ["name", "numbers"])
# Sort the "numbers" column using array_sort
sorted_df ="name", array_sort("numbers").alias('numbers'))
# Show the sorted DataFrame

In this example, we create a DataFrame with two columns: “name” and “numbers”. The “numbers” column is an array of integers. We then use the array_sort function to sort the “numbers” column in ascending order. The function returns a new DataFrame with the sorted column, which we assign to the variable sorted_df. The original DataFrame remains unchanged.

Sorted output :

|        name|  numbers|
|Twinkle Baby|[1, 2, 3]|
|   Barry Tim|[3, 4, 5]|
|King Mandate|[5, 6, 7]|

As you can see, the “numbers” column is now sorted in ascending order. Keep in mind that this function only sorts the array, it doesn’t affect any other columns in the DataFrame.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply