PySpark-How to create and RDD from a List and from AWS S3

PySpark @ Freshers.in

In this article you will learn , what an RDD is ?  How can we create an RDD from a Python List ? What is Parallelize ? How to create RDD from S3 ?

RDD : RDD (Resilient Distributed Datasets) is an immutable distributed collection of elements of your data, partitioned across nodes.

Parallelize : Parallelized collection is created by calling “SparkContext” parallelize method on a collection in the driver program. Once we call a parallelize,  elements in the collection will copied to form a distributed dataset which in turn can be operated in parallel.

# Converting List to an RDD
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Freshers_in").getOrCreate()
sample_data = ["INDIA","USA","CANADA","INDIA","USA","JAPAN","UK","UAE","INDIA"] 
# type(sample_data) => <type 'list'>
rdd=spark.sparkContext.parallelize(sample_data) 
#Converted as RDD : type(rdd) => <class 'pyspark.rdd.RDD'>
rdd.collect()

# Reading from a S3 and converting to RDD
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Freshers_in").getOrCreate()
rdd_data = spark.sparkContext.textFile('s3://sem-freshers-in-spark_training/training/sample_txt.txt') 
# created RDD from External Source :  type(rdd_data) <class 'pyspark.rdd.RDD'>
rdd_data.collect()

How to run dataframe as Spark SQL?
How to get all combination of columns using PySpark? What is Cube in Spark ?
How to remove csv header using Spark (PySpark) ?

Author: user

Leave a Reply

Your email address will not be published. Required fields are marked *