How to remove csv header using Spark (PySpark)

A common use case when dealing with CSV file is to remove the header from the source to do data analysis. In PySpark  this can be done as bellow.

Source Code ( PySpark – Python 3.6 and Spark 3, this is compatible with spark 2.2+ ad Python 2.7)

from pyspark import SparkContext
import csv
sc = SparkContext()
readFile = sc.textFile("D:\\Users\\speedika\\PycharmProjects\\sparkprojects\\sample_csv_01.csv")
readCSV = readFile.mapPartitions(lambda x : csv.reader(x))
file_with_indx = readCSV.zipWithIndex() 
for data_with_idx in file_with_indx.collect():
    print (data_with_idx)
rmHeader = file_with_indx.filter(lambda x : x[1] > 0).map(lambda x : x[0])
for cleanse_data in rmHeader.collect():
    print(cleanse_data)

Code Explanation
file_with_indx = readCSV.zipWithIndex()
The zipWithIndex() transformation appends the RDD with the element indices. Each row in the CSV will have and index attached starting from 0.
rmHeader = file_with_indx.filter(lambda x : x[1] > 0).map(lambda x : x[0])
This will remove the rows with index less than 0. So if you want to skip ‘n’ number of rows you can use the same code as well.

Note: Here we use the print statements to show the functionality .

Sample data 
Name,Country,Phone
TOM,USA,343-098-292
JACK,CHINA,783-098-232
CHARLIE,INDIA,873-984-123
SUSAN,JAPAN,898-231-987
MIKE,UK,987-989-121

Result
['TOM', 'USA', '343-098-292']
['JACK', 'CHINA', '783-098-232']
['CHARLIE', 'INDIA', '873-984-123']
['SUSAN', 'JAPAN', '898-231-987']
['MIKE', 'UK', '987-989-121']

Reference documentation  : zipWithIndex()

Spark Interview Questions and Answers

Author: user

Leave a Reply