PySpark : Reading parquet file stored on Amazon S3 using PySpark

user March 27, 2023 Leave a Comment on PySpark : Reading parquet file stored on Amazon S3 using PySpark

To read a Parquet file stored on Amazon S3 using PySpark, you can use the following code:

from pyspark.sql import SparkSession
# create a SparkSession
spark = SparkSession.builder \
        .appName("Read S3 Parquet file") \
        .getOrCreate()
# set S3 credentials if necessary
spark.conf.set("spark.hadoop.fs.s3a.access.key", "ACCESS_KEY")
spark.conf.set("spark.hadoop.fs.s3a.secret.key", "SECRET_KEY")
# read parquet file from S3
df = spark.read.parquet("s3a://freshers_bkt/training/view_country/parquet_file")
# show data
df.show()

If in your system , if you have already configured access s3 with your instance then you can remove the line starting with spark.conf.set . You can directly read using spark.read.parquet , make sure that you need to read as s3a

In this code, you first create a SparkSession. Then, you can set the S3 credentials if necessary using the spark.conf.set() method. Finally, you can read the Parquet file from S3 using the spark.read.parquet() method and passing the S3 path of the file as an argument. Once you have read the file, you can use the df.show() method to display the data.

Spark important urls to refer

Spark Examples
PySpark Blogs
Bigdata Blogs
Spark Interview Questions
Official Page

Post Views: 1,128

PySpark : How do I read a parquet file in Spark
To read a Parquet file in Spark, you can use the spark.read.parquet() method, which returns…
PySpark : Reading from multiple files , how to get the file which contain each record in PySpark [input_file_name]
pyspark.sql.functions.input_file_name One of the most useful features of PySpark is the ability to access metadata…
Advantages of using Parquet file
Parquet is a columnar storage format that is designed to work with big data processing…
AWS Glue : Example on how to read a sample csv file with PySpark
Reading a sample csv file using PySpark Here assume that you have your CSV data…
PySpark-How to create and RDD from a List and from AWS S3
In this article you will learn , what an RDD is ? How can we…
Explain what is happening internally once you upload a file in Amazon S3
This article will explain what is happening inside the S3 once you upload a file. …
PySpark - How to read a text file as RDD using Spark3 and Display the result in Windows 10
Here we will see how to read a sample text file as RDD using Spark…
PySpark : Explanation of MapType in PySpark with Example
MapType in PySpark is a data type used to represent a value that maps keys…
PySpark : How to decode in PySpark ?
pyspark.sql.functions.decode The pyspark.sql.functions.decode Function in PySpark PySpark is a popular library for processing big data…
PySpark : HiveContext in PySpark - A brief explanation
One of the key components of PySpark is the HiveContext, which provides a SQL-like interface…