PySpark : How do I read a parquet file in Spark

PySpark @

To read a Parquet file in Spark, you can use the method, which returns a DataFrame. Here is an example of how you can use this method to read a Parquet file and display the contents:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("ReadParquet").getOrCreate()

# Read the Parquet file
df ="path/to/file.parquet")

# Show the contents of the DataFrame

# Stop the SparkSession

You can also read a parquet file from a hdfs directory,

df ="parquet").load("hdfs://path/to/directory")

You can also read a parquet file with filtering using the where method

df ="freshers_path/to/freshers_in.parquet").where("column_name = 'value'")

In addition to reading a single Parquet file, you can also read a directory containing multiple Parquet files by specifying the directory path instead of a file path, like this:

df ="freshers_path/to/directory")

You can also use the schema option to specify the schema of the parquet file:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType())

df ="freshers_path/to/file.parquet")

By providing the schema, Spark will skip the expensive process of inferring the schema from the parquet file, which can be useful when working with large datasets.

Author: user

Leave a Reply