PySpark : How do I read a parquet file in Spark

PySpark @ Freshers.in

To read a Parquet file in Spark, you can use the spark.read.parquet() method, which returns a DataFrame. Here is an example of how you can use this method to read a Parquet file and display the contents:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("ReadParquet").getOrCreate()

# Read the Parquet file
df = spark.read.parquet("path/to/file.parquet")

# Show the contents of the DataFrame
df.show()

# Stop the SparkSession
spark.stop()

You can also read a parquet file from a hdfs directory,

df = spark.read.format("parquet").load("hdfs://path/to/directory")

You can also read a parquet file with filtering using the where method

df = spark.read.parquet("freshers_path/to/freshers_in.parquet").where("column_name = 'value'")

In addition to reading a single Parquet file, you can also read a directory containing multiple Parquet files by specifying the directory path instead of a file path, like this:

df = spark.read.parquet("freshers_path/to/directory")

You can also use the schema option to specify the schema of the parquet file:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType())
])

df = spark.read.schema(schema).parquet("freshers_path/to/file.parquet")

By providing the schema, Spark will skip the expensive process of inferring the schema from the parquet file, which can be useful when working with large datasets.

Author: user

Leave a Reply