In this article you will learn , what an RDD is ? How can we create an RDD from a Python List ? What is Parallelize ? How to create RDD from S3 ?
RDD : RDD (Resilient Distributed Datasets) is an immutable distributed collection of elements of your data, partitioned across nodes.
Parallelize : Parallelized collection is created by calling “SparkContext” parallelize method on a collection in the driver program. Once we call a parallelize, elements in the collection will copied to form a distributed dataset which in turn can be operated in parallel.
# Converting List to an RDD from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Freshers_in").getOrCreate() sample_data = ["INDIA","USA","CANADA","INDIA","USA","JAPAN","UK","UAE","INDIA"] # type(sample_data) => <type 'list'> rdd=spark.sparkContext.parallelize(sample_data) #Converted as RDD : type(rdd) => <class 'pyspark.rdd.RDD'> rdd.collect() # Reading from a S3 and converting to RDD from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Freshers_in").getOrCreate() rdd_data = spark.sparkContext.textFile('s3://sem-freshers-in-spark_training/training/sample_txt.txt') # created RDD from External Source : type(rdd_data) <class 'pyspark.rdd.RDD'> rdd_data.collect()