PySpark : PySpark to extract specific fields from XML data

user April 3, 2023 Leave a Comment

XML data is commonly used in data exchange and storage, and it can contain complex hierarchical structures. PySpark provides a simple and efficient way to extract specific fields from XML data using its built-in functions.

Sample XML data

<?xml version="1.0" encoding="UTF-8"?>
<People>
    <Person>
        <Name>John</Name>
        <Age>30</Age>
        <City>New York</City>
    </Person>
    <Person>
        <Name>Jane</Name>
        <Age>35</Age>
        <City>London</City>
    </Person>
    <Person>
        <Name>Jack</Name>
        <Age>25</Age>
        <City>Paris</City>
    </Person>
    <Person>
        <Name>Emily</Name>
        <Age>40</Age>
        <City>Tokyo</City>
    </Person>
</People>

Input Data
Let’s assume we have the following dataset that contains XML data in a column:

+----+-----------------------+
| ID |          XML          |
+----+-----------------------+
|  1 |<Person><Name>John</Name><Age>30</Age><City>New York</City></Person>|
|  2 |<Person><Name>Jane</Name><Age>35</Age><City>London</City></Person>|
|  3 |<Person><Name>Jack</Name><Age>25</Age><City>Paris</City></Person>|
|  4 |<Person><Name>Emily</Name><Age>40</Age><City>Tokyo</City></Person>|
+----+-----------------------+

Extracting Specific Fields from XML Data in PySpark

To extract specific fields from XML data in PySpark, we can use the xpath function. The xpath function evaluates an XPath expression against the XML data and returns the result as a string, an array, or a struct.

For example, to extract the Name and Age fields from the XML data in the input DataFrame, we can use the following code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import xpath

# create a SparkSession
spark = SparkSession.builder.appName("ExtractXMLFields").getOrCreate()

# load the input data into a DataFrame
df = spark.createDataFrame([
    (1, "<Person><Name>John</Name><Age>30</Age><City>New York</City></Person>"),
    (2, "<Person><Name>Jane</Name><Age>35</Age><City>London</City></Person>"),
    (3, "<Person><Name>Jack</Name><Age>25</Age><City>Paris</City></Person>"),
    (4, "<Person><Name>Emily</Name><Age>40</Age><City>Tokyo</City></Person>")
], ["ID", "XML"])

# extract the Name and Age fields from the XML column
df_xml = df.select("ID", xpath("XML", "/Person/Name/text()").alias("Name"), xpath("XML", "/Person/Age/text()").alias("Age"))

# show the result
df_xml.show()

Output

+---+----+---+
| ID|Name|Age|
+---+----+---+
|  1|John| 30|
|  2|Jane| 35|
|  3|Jack| 25|
|  4|Emily| 40|
+---+----+---+

As we can see, the output DataFrame contains the Name and Age fields extracted from the XML data in the input DataFrame.

Extracting specific fields from XML data in PySpark is a simple and efficient process using the xpath function. By specifying the XPath expression that matches the desired fields, we can easily extract the specific fields from the XML data. This is an essential step in data preprocessing and cleaning that facilitates the analysis and modeling of complex hierarchical data structures.

Sample code to read XML data from a file using PySpark

from pyspark.sql import SparkSession
# create a SparkSession
spark = SparkSession.builder.appName("ReadXML").getOrCreate()
# read the XML file into a DataFrame
df = spark.read.format("xml").option("rowTag", "Person").load("path/to/xml/file.xml")
# show the result
df.show()

Spark important urls to refer

Post Views: 575

Author: user

PySpark : PySpark to extract specific fields from XML data

Extracting Specific Fields from XML Data in PySpark

Sample code to read XML data from a file using PySpark

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

Extracting Specific Fields from XML Data in PySpark

Sample code to read XML data from a file using PySpark

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget