How to remove csv header using Spark (PySpark)

A common use case when dealing with CSV file is to remove the header from the source to do data analysis. In PySpark  this can be done as bellow.

Source Code ( PySpark – Python 3.6 and Spark 3, this is compatible with spark 2.2+ ad Python 2.7)

from pyspark import SparkContext
import csv
sc = SparkContext()
readFile = sc.textFile("D:\\Users\\speedika\\PycharmProjects\\sparkprojects\\sample_csv_01.csv")
readCSV = readFile.mapPartitions(lambda x : csv.reader(x))
file_with_indx = readCSV.zipWithIndex() 
for data_with_idx in file_with_indx.collect():
    print (data_with_idx)
rmHeader = file_with_indx.filter(lambda x : x[1] > 0).map(lambda x : x[0])
for cleanse_data in rmHeader.collect():

Code Explanation
file_with_indx = readCSV.zipWithIndex()
The zipWithIndex() transformation appends the RDD with the element indices. Each row in the CSV will have and index attached starting from 0.
rmHeader = file_with_indx.filter(lambda x : x[1] > 0).map(lambda x : x[0])
This will remove the rows with index less than 0. So if you want to skip ‘n’ number of rows you can use the same code as well.

Note: Here we use the print statements to show the functionality .

Sample data 

['TOM', 'USA', '343-098-292']
['JACK', 'CHINA', '783-098-232']
['CHARLIE', 'INDIA', '873-984-123']
['SUSAN', 'JAPAN', '898-231-987']
['MIKE', 'UK', '987-989-121']

Reference documentation  : zipWithIndex()

Spark Interview Questions and Answers

Author: user

Leave a Reply