Skip to main content
Version: 8.1

Parsing XML with Java Libraries

XML Document Security

Parsing XML can be potentially dangerous without hardening the parser implementation. Best practices on how to prevent attackers from exploiting vulnerabilities can be found here.

What is the DOM Parser?​

The Document Object Model (DOM) parser provides a powerful way to parse and manipulate XML documents. It's commonly used due to its ease of use and comprehensive functionality. The DOM parser breaks down XML into accessible elements, each representing a node in the XML tree structure. For more information on interfacing with the DOM parser, refer to the Java XML DOM Parser Documentation.

Using the DOM Parser​

There are several ways to import XML data using the DOM parser, depending on how it's stored. It can retrieve data from an XML file using the file path or from a string. Regardless of the method, it provides a root object representing the XML document.

Jython - Reading a File
from javax.xml.parsers import DocumentBuilderFactory
from java.io import File

# Define your XML file path
xmlFilePath = "file.xml" # Replace with your actual XML file path

# Create a DOM document builder
builderFactory = DocumentBuilderFactory.newInstance()
builder = builderFactory.newDocumentBuilder()

# Parse the XML file
file = File(xmlFilePath)
document = builder.parse(file)

# Access the root element
root = document.getDocumentElement()
Jython - Reading from a String
from javax.xml.parsers import DocumentBuilderFactory
from java.io import ByteArrayInputStream

# Define your XML string
xmlString = """
<employee id="1234">
<name>John Smith</name>
<start_date>2010-11-26</start_date>
<department>IT</department>
<title>Tech Support</title>
</employee>
""" # Replace with your actual XML string

# Create a DOM document builder
builderFactory = DocumentBuilderFactory.newInstance()
builder = builderFactory.newDocumentBuilder()

# Parse the XML string
stream = ByteArrayInputStream(xmlString.encode('utf-8'))
document = builder.parse(stream)

# Access the root element
root = document.getDocumentElement()

Each tag is considered an element object. For instance, in the given example, the root element would be the employee tag. Elements can have attributes contained within the tag itself. In the example above, the employee element has an id attribute with a value of 1234. Additionally, elements can have additional data, typically between the start and end tags. This data can be accessed using the Element object's built-in functionality.

FunctionDescriptionExampleOutput
Element.tagNameReturns the name of the element's tag.
print root.tagName
employee
Element.attributesReturns a dictionary of the element's attributes.
print root.attributes.item(0)
id: "1234"
Element.textContentReturns the additional data of the element.
print root.textContent
John Smith
2010-11-26
IT
Tech Support
Element.item(index)Allows direct reference to an element's children by index.
print root.item(5).tagName
department

A Simple Employee Example​

Using the functions above, let's parse through a sample XML string and extract employee data. We'll demonstrate how to access different elements and attributes and display them. Let's iterate through the XML elements and print out the following employee details:

Code Output
Employee ID: 1
Name: John Doe
Department: Engineering

Employee ID: 2
Name: Jane Smith
Department: Marketing

Code Snippet - Extracting Employee Details
from javax.xml.parsers import DocumentBuilderFactory
from java.io import ByteArrayInputStream

# Define your XML string
xmlString = """
<employees>
<employee id="1">
<name>John Doe</name>
<department>Engineering</department>
</employee>
<employee id="2">
<name>Jane Smith</name>
<department>Marketing</department>
</employee>
</employees>
""" # Replace with your actual XML string

# Create a DOM document builder
builderFactory = DocumentBuilderFactory.newInstance()
builder = builderFactory.newDocumentBuilder()

# Parse the XML string
document = builder.parse(ByteArrayInputStream(xmlString.encode()))

# Access the root element
root = document.getDocumentElement()

# Iterate through employees
employees = root.getElementsByTagName("employee")
for i in range(employees.getLength()):
employee = employees.item(i)
# Convert the id attribute to an integer
id = int(employee.getAttribute("id"))
print "Employee ID:", id
print "Name:", employee.getElementsByTagName("name").item(0).textContent
print "Department:", employee.getElementsByTagName("department").item(0).textContent
print

What is the SAX Parser?​

The Simple API for XML (SAX) parser, available through Java libraries, provides an event-driven approach to parse XML documents. It's widely used for its efficiency, especially when handling large XML files. SAX parses XML sequentially and triggers events as it encounters elements, attributes, and other components in the XML document. For more detailed information about the SAX parser, refer to the Java XML SAX Parser Documentation.

Using the SAX Parser​

The SAX parser doesn't build a tree structure like the DOM parser. Instead, it parses the XML document sequentially and triggers events that the developer can handle. Here's an example of using the SAX parser to parse an XML file:

Java - Reading a String
from javax.xml.parsers import SAXParserFactory
from org.xml.sax.helpers import DefaultHandler
from java.io import ByteArrayInputStream

# Define your XML string
xmlString = """
<employees>
<employee id="1">
<name>John Doe</name>
<department>Engineering</department>
</employee>
<employee id="2">
<name>Jane Smith</name>
<department>Marketing</department>
</employee>
</employees>
""" # Replace with your actual XML string

# Define a custom ContentHandler
class MyContentHandler(DefaultHandler):
def startElement(self, uri, localName, qName, attributes):
print("Start Element:", qName)
for i in range(attributes.getLength()):
print("Attribute:", attributes.getQName(i), "=", attributes.getValue(i))

def endElement(self, uri, localName, qName):
print("End Element:", qName)

def characters(self, ch, start, length):
print("Character Data:", ch[start:start+length])

# Create a SAX parser
saxParserFactory = SAXParserFactory.newInstance()
saxParser = saxParserFactory.newSAXParser()

# Parse the XML string
stream = ByteArrayInputStream(xmlString.encode('utf-8'))
saxParser.parse(stream, MyContentHandler())

In the above example, we define a custom ContentHandler class that extends DefaultHandler. This class overrides methods to handle idfferent events encountered during XML parsing, such as starting and ending elements, and character data.

What is the StAX Parser?​

The Streaming API for XML (StAX) parser, available through Java libraries, offers a cursor-based approach to parse XML documents. It provides an efficient way to read and process XML sequentially without loading the entire document into memory. StAX parsers allow developers to iterate through XML elements, attributes, and other components as they are encountered in the XML stream. For more detailed information about the StAX parser, refer to the Java XML StAX Parser Documentation.

Using the StAX Parser​

The StAX parser operates in a streaming fashion, allowing developers to read XML content sequentially without the need to build a complete in-memory representation of the XML document. Here's an example of how to create an XML input factory and a stream reader to parse the XML content. We iterate through the XML stream and handle different events such as starting and ending elements, as well as character data.

Code Snippet - Extracting Employee Details
from javax.xml.stream import XMLInputFactory, XMLStreamReader
from java.io import ByteArrayInputStream

# Create an XML input factory
inputFactory = XMLInputFactory.newInstance()

# Create an XML stream reader
streamReader = inputFactory.createXMLStreamReader(ByteArrayInputStream(xmlString.encode()))

# Iterate through the XML stream
while streamReader.hasNext():
event = streamReader.next()
if event == XMLStreamReader.START_ELEMENT:
print("Start Element:", streamReader.getLocalName())
# Print attributes, if any
for i in range(streamReader.getAttributeCount()):
print("Attribute:", streamReader.getAttributeLocalName(i), "=", streamReader.getAttributeValue(i))
elif event == XMLStreamReader.END_ELEMENT:
print("End Element:", streamReader.getLocalName())
elif event == XMLStreamReader.CHARACTERS:
print("Character Data:", streamReader.getText())