Here we will see how to read a sample text file as RDD using Spark
Environment and version which we use here are
Spark : 3.0.3
Python : version 3.8.10
Java : 11.0.13 2021-10-19 LTS
My OS : Windows 10 Pro
Use case : Read data from local and Print in the console
My Local data set : D:\\Learning\\PySpark\\SourceCode\\sample_data.txt
from pyspark import SparkContext sc = SparkContext.getOrCreate() textFile = sc.textFile("D:\\Freshers_in\\PySpark\\SourceCode\\sample_data.txt") print(textFile.collect())
getOrCreate : Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder. Here we are not giving any options
PySpark Collect() : Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.