PySpark :Remove any key-value pair that has a key present in another RDD [subtractByKey]

user April 13, 2023 Leave a Comment

In this article, we will explore the use of subtractByKey in PySpark, a transformation that returns an RDD consisting of key-value pairs from one RDD by removing any pair that has a key present in another RDD. We will provide a detailed example using hardcoded values as input.

First, let’s create two PySpark RDDs

#Using subtractByKey in PySpark @Freshers.in
from pyspark import SparkContext
sc = SparkContext("local", "subtractByKey @ Freshers.in ")
data1 = [("America", 1), ("Botswana", 2), ("Costa Rica", 3), ("Denmark", 4), ("Egypt", 5)]
data2 = [("Botswana", 20), ("Denmark", 40), ("Finland", 60)]

rdd1 = sc.parallelize(data1)
rdd2 = sc.parallelize(data2)

Using subtractByKey

Now, let’s use the subtractByKey method to create a new RDD by removing key-value pairs from rdd1 that have keys present in rdd2:

result_rdd = rdd1.subtractByKey(rdd2)
result_data = result_rdd.collect()
print("Result of subtractByKey:")
for element in result_data:
    print(element)

In this example, we used the subtractByKey method on rdd1 and passed rdd2 as an argument. The method returns a new RDD containing key-value pairs from rdd1 after removing any pair with a key present in rdd2. The collect method is then used to retrieve the results.

Interpreting the Results

Result of subtractByKey:
('Costa Rica', 3)
('America', 1)
('Egypt', 5)

The resulting RDD contains key-value pairs from rdd1 with the key-value pairs having keys “Botswana” and “Denmark” removed, as these keys are present in rdd2.

In this article, we explored the use of subtractByKey in PySpark, a transformation that returns an RDD consisting of key-value pairs from one RDD by removing any pair that has a key present in another RDD. We provided a detailed example using hardcoded values as input, showcasing how to create two RDDs with key-value pairs, use the subtractByKey method, and interpret the results. subtractByKey can be useful in various scenarios, such as filtering out unwanted data based on keys or performing set-like operations on key-value pair RDDs.

Spark important urls to refer

Post Views: 65

Author: user

PySpark :Remove any key-value pair that has a key present in another RDD [subtractByKey]

Using subtractByKey

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

How PARTITION BY Works in Snowflake, and SQL in general

Stash a specific file using Git

Prevent your computer from locking : Python to simulate mouse movements

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Most Viewed Posts

Using subtractByKey

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget