PySpark : PySpark to extract specific fields from XML data

PySpark @ Freshers.in

XML data is commonly used in data exchange and storage, and it can contain complex hierarchical structures. PySpark provides a simple and efficient way to extract specific fields from XML data using its built-in functions.

Sample XML data

<?xml version="1.0" encoding="UTF-8"?>
<People>
    <Person>
        <Name>John</Name>
        <Age>30</Age>
        <City>New York</City>
    </Person>
    <Person>
        <Name>Jane</Name>
        <Age>35</Age>
        <City>London</City>
    </Person>
    <Person>
        <Name>Jack</Name>
        <Age>25</Age>
        <City>Paris</City>
    </Person>
    <Person>
        <Name>Emily</Name>
        <Age>40</Age>
        <City>Tokyo</City>
    </Person>
</People>

Input Data
Let’s assume we have the following dataset that contains XML data in a column:

+----+-----------------------+
| ID |          XML          |
+----+-----------------------+
|  1 |<Person><Name>John</Name><Age>30</Age><City>New York</City></Person>|
|  2 |<Person><Name>Jane</Name><Age>35</Age><City>London</City></Person>|
|  3 |<Person><Name>Jack</Name><Age>25</Age><City>Paris</City></Person>|
|  4 |<Person><Name>Emily</Name><Age>40</Age><City>Tokyo</City></Person>|
+----+-----------------------+

Extracting Specific Fields from XML Data in PySpark

To extract specific fields from XML data in PySpark, we can use the xpath function. The xpath function evaluates an XPath expression against the XML data and returns the result as a string, an array, or a struct.

For example, to extract the Name and Age fields from the XML data in the input DataFrame, we can use the following code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import xpath

# create a SparkSession
spark = SparkSession.builder.appName("ExtractXMLFields").getOrCreate()

# load the input data into a DataFrame
df = spark.createDataFrame([
    (1, "<Person><Name>John</Name><Age>30</Age><City>New York</City></Person>"),
    (2, "<Person><Name>Jane</Name><Age>35</Age><City>London</City></Person>"),
    (3, "<Person><Name>Jack</Name><Age>25</Age><City>Paris</City></Person>"),
    (4, "<Person><Name>Emily</Name><Age>40</Age><City>Tokyo</City></Person>")
], ["ID", "XML"])

# extract the Name and Age fields from the XML column
df_xml = df.select("ID", xpath("XML", "/Person/Name/text()").alias("Name"), xpath("XML", "/Person/Age/text()").alias("Age"))

# show the result
df_xml.show()

Output

+---+----+---+
| ID|Name|Age|
+---+----+---+
|  1|John| 30|
|  2|Jane| 35|
|  3|Jack| 25|
|  4|Emily| 40|
+---+----+---+

As we can see, the output DataFrame contains the Name and Age fields extracted from the XML data in the input DataFrame.

Extracting specific fields from XML data in PySpark is a simple and efficient process using the xpath function. By specifying the XPath expression that matches the desired fields, we can easily extract the specific fields from the XML data. This is an essential step in data preprocessing and cleaning that facilitates the analysis and modeling of complex hierarchical data structures.

Sample code to read XML data from a file using PySpark

from pyspark.sql import SparkSession
# create a SparkSession
spark = SparkSession.builder.appName("ReadXML").getOrCreate()
# read the XML file into a DataFrame
df = spark.read.format("xml").option("rowTag", "Person").load("path/to/xml/file.xml")
# show the result
df.show()

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply