PySpark : Reading from multiple files , how to get the file which contain each record in PySpark [input_file_name]

PySpark @ Freshers.in

pyspark.sql.functions.input_file_name

One of the most useful features of PySpark is the ability to access metadata about the input files being processed in a job. This metadata can be used to perform a variety of tasks, including filtering data based on file name, partitioning data based on file location, and more.

The input_file_name function is a built-in PySpark function that allows you to access the name of the file being processed in a PySpark job. This function is available in PySpark 2.2 and later versions and can be used to extract information about the input file being processed.

To use input_file_name, you first need to import it from the PySpark SQL functions module. Here’s an example:

from pyspark.sql.functions import input_file_name

Once you’ve imported input_file_name, you can use it in a PySpark DataFrame transformation to extract information about the input file. Here’s an example:

from pyspark.sql.functions import input_file_name
# create a DataFrame from a CSV file
df = spark.read.format('csv').load('D:\Learning\PySpark\infiles\*')
df.show(20,False)

Result

+---+-----+---+----+
|_c0|  _c1|_c2| _c3|
+---+-----+---+----+
|301| John| 30|3000|
|302|Jibin| 31|3050|
|303|Jerry| 32|3075|
|101|  Sam| 10|1000|
|102|Peter| 11|1050|
|103| Eric| 12|1075|
|201|Albin| 20|2000|
|202| Eldo| 21|2050|
|203|  Joy| 22|2075|
+---+-----+---+----+

Implementing input_file_name()

# add a new column to the DataFrame containing the file name
df = df.withColumn('input_file', input_file_name())
df.show(20,False)

Output

+---+-----+---+----+--------------------------------------------------+
|_c0|_c1  |_c2|_c3 |input_file                                        |
+---+-----+---+----+--------------------------------------------------+
|301|John |30 |3000|file:///D:/Learning/PySpark/infiles/21-03-2023.csv|
|302|Jibin|31 |3050|file:///D:/Learning/PySpark/infiles/21-03-2023.csv|
|303|Jerry|32 |3075|file:///D:/Learning/PySpark/infiles/21-03-2023.csv|
|101|Sam  |10 |1000|file:///D:/Learning/PySpark/infiles/19-03-2023.csv|
|102|Peter|11 |1050|file:///D:/Learning/PySpark/infiles/19-03-2023.csv|
|103|Eric |12 |1075|file:///D:/Learning/PySpark/infiles/19-03-2023.csv|
|201|Albin|20 |2000|file:///D:/Learning/PySpark/infiles/20-03-2023.csv|
|202|Eldo |21 |2050|file:///D:/Learning/PySpark/infiles/20-03-2023.csv|
|203|Joy  |22 |2075|file:///D:/Learning/PySpark/infiles/20-03-2023.csv|
+---+-----+---+----+--------------------------------------------------+

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply