PySpark : Understanding PySpark’s map_from_arrays Function with detailed examples

PySpark @

PySpark provides a wide range of functions to manipulate and transform data within DataFrames. In this article, we will focus on the map_from_arrays function, which allows you to create a map column by combining two arrays. We will discuss the functionality, syntax, and provide a detailed example with input data to illustrate its usage.

  1. The map_from_arrays Function in PySpark

The map_from_arrays function is a part of the PySpark SQL library, which provides various functions to work with different data types. This function creates a map column by combining two arrays, where the first array contains keys, and the second array contains values. The resulting map column is useful for representing key-value pairs in a compact format.


pyspark.sql.functions.map_from_arrays(keys, values)
keys: An array column containing the map keys.
values: An array column containing the map values.
  1. A Detailed Example of Using the map_from_arrays Function

Let’s create a PySpark DataFrame with two array columns, representing keys and values, and apply the map_from_arrays function to combine them into a map column.

First, let’s import the necessary libraries and create a sample DataFrame:

from pyspark.sql import SparkSession
from pyspark.sql.functions import map_from_arrays
from pyspark.sql.types import StringType, ArrayType
# Create a Spark session
spark = SparkSession.builder.master("local").appName("map_from_arrays Function Example").getOrCreate()
# Sample data
data = [(["a", "b", "c"], [1, 2, 3]), (["x", "y", "z"], [4, 5, 6])]
# Define the schema
schema = ["Keys", "Values"]
# Create the DataFrame
df = spark.createDataFrame(data, schema)

Now that we have our DataFrame, let’s apply the map_from_arrays function to it:

# Apply the map_from_arrays function
df = df.withColumn("Map", map_from_arrays(df["Keys"], df["Values"]))
# Show the results
|Keys     |Values   |Map                     |
|[a, b, c]|[1, 2, 3]|{a -> 1, b -> 2, c -> 3}|
|[x, y, z]|[4, 5, 6]|{x -> 4, y -> 5, z -> 6}|
In this example, we created a PySpark DataFrame with two array columns, “Keys” and “Values”, and applied the map_from_arrays function to combine them into a “Map” column. The output DataFrame displays the original keys and values arrays, as well as the resulting map column.

The PySpark map_from_arrays function is a powerful and convenient tool for working with array columns and transforming them into a map column. With the help of the detailed example provided in this article, you should be able to effectively use the map_from_arrays function in your own PySpark projects.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply