PySpark : Reversing the order of strings in a list using PySpark

PySpark @ Freshers.in

Lets create a sample data in the form of a list of strings.

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName('Reverse String @ Freshers.in Learning')
sc = SparkContext.getOrCreate();
spark = SparkSession(sc) 
# Sample data
data = ['Sachin', 'Narendra', 'Arun', 'Oracle', 'Redshift']
# Parallelize the data with Spark
rdd = sc.parallelize(data)

Now, we can apply a map operation on this RDD (Resilient Distributed Datasets, the fundamental data structure of Spark). The map operation applies a given function to each element of the RDD and returns a new RDD.

We will use the built-in Python function reversed() inside a map operation to reverse the order of each string. reversed() returns a reverse iterator, so we have to join it back into a string with ”.join().

# Apply map operation to reverse the strings
reversed_rdd = rdd.map(lambda x: ''.join(reversed(x)))

The lambda function here is a simple anonymous function that takes one argument, x, and returns the reversed string. x is each element of the RDD (each string in this case).

After this operation, we have a new RDD where each string from the original RDD has been reversed. You can collect the results back to the driver program using the collect() action.

# Collect the results
reversed_data = reversed_rdd.collect()

# Print the reversed strings
for word in reversed_data:
    print(word)

As you can see, the order of characters in each string from the list has been reversed. Note that Spark operations are lazily evaluated, meaning the actual computations (like reversing the strings) only happen when an action (like collect()) is called. This feature allows Spark to optimize the overall data processing workflow.

Complete code

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName('Reverse String @ Freshers.in Learning')
sc = SparkContext.getOrCreate();
spark = SparkSession(sc) 
#Sample data for testing
data = ['Sachin', 'Narendra', 'Arun', 'Oracle', 'Redshift']
#Parallelize the data with Spark
rdd = sc.parallelize(data)
reversed_rdd = rdd.map(lambda x: ''.join(reversed(x)))
#Collect the results
reversed_data = reversed_rdd.collect()
#Print the reversed strings
for word in reversed_data:
    print(word)
Author: user

Leave a Reply