AWS Glue : Example on how to read a sample csv file with PySpark

user December 28, 2021 Leave a Comment on AWS Glue : Example on how to read a sample csv file with PySpark

Reading a sample csv file using PySpark

Here assume that you have your CSV data in AWS S3 bucket. The next step is the crawl the data that is in AWS S3 bucket. Once its done , you can find the crawler has created a metadata table for your csv data.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
freshers_data = spark.read.format("com.databricks.spark.csv").option(
"header", "true").option(
"inferSchema", "true").load(
's3://freshers_in_datasets/training/students/final_year.csv')
freshers_data.printSchema()

Result

root
|-- Freshers def: string (nullable = true)
|-- student Id: string (nullable = true)
|-- student Name: string (nullable = true)
|-- student Street Address: string (nullable = true)
|-- student City: string (nullable = true)
|-- student State: string (nullable = true)
|-- student Zip Code: integer (nullable = true)

Spark Reference

Spark Official Doc

Post Views: 1,012

PySpark : How do I read a parquet file in Spark
To read a Parquet file in Spark, you can use the spark.read.parquet() method, which returns…
PySpark : How to read date datatype from CSV ?
We specify schema = true when a CSV file is being read. Spark determines the…
PySpark - How to read a text file as RDD using Spark3 and Display the result in Windows 10
Here we will see how to read a sample text file as RDD using Spark…
How to read data from AWS Secrets Manager using Python ?
Python programmers can utilise the boto3 library, which is the AWS SDK for Python, to…
How to create a table from CSV file and write SQL on top of it in Spark (Sample code)
In this article you will see how you can read a CSV file using pySpark…
How to remove csv header using Spark (PySpark)
A common use case when dealing with CSV file is to remove the header from…
PySpark : Reading from multiple files , how to get the file which contain each record in PySpark [input_file_name]
pyspark.sql.functions.input_file_name One of the most useful features of PySpark is the ability to access metadata…
Pyspark code to read and write data from and to google Bigquery.
Here is some sample PySpark code that demonstrates how to read and write data from…
In pyspark what is the difference between Spark spark.table() and spark.read.table()
In PySpark, spark.table() is used to read a table from the Spark catalog, whereas spark.read.table()…
How to renaming Spark Dataframe having a complex schema with AWS Glue - PySpark: pyspark rename columns
pyspark rename columns There can be multiple reason to rename the Spark Data frame .…