XML to CSV With XPath: Extracting Specific Fields From Complex Documents

Why XML to CSV Is Non-Trivial

XML stores hierarchical data with arbitrary nesting:

<library>
  <book id="1">
    <title>Book Title</title>
    <authors>
      <author>Alice</author>
      <author>Bob</author>
    </authors>
    <year>2024</year>
  </book>
</library>

CSV is flat:

id,title,authors,year
1,Book Title,"Alice, Bob",2024

The conversion needs to flatten the hierarchy: combine multiple authors, drop nesting markers, decide which fields go to columns.

XPath is the standard query language for XML. With XPath, you can extract specific elements without parsing the entire document. This post covers the practical workflows.

For broader CSV processing, see Batch Text Replacement in CSV.

XPath Basics

XPath syntax for selecting elements:

Expression	Meaning
`/library/book`	Direct children
`//book`	Any descendant
`book[@id="1"]`	Attribute filter
`book[year > 2020]`	Numeric filter
`book/authors/author`	Nested traversal
`book[1]`	Position (first)
`book[last()]`	Last element
`name()`	Element name
`text()`	Text content
`@id`	Attribute value

For most extractions: simple paths plus filters.

xmlstarlet (CLI)

xmlstarlet is the command-line XML toolkit:

# Install
brew install xmlstarlet  # Mac
sudo apt install xmlstarlet  # Linux

# Extract values
xmlstarlet sel -t -m "//book" \
  -v "@id" -o "," \
  -v "title" -o "," \
  -v "year" -n \
  input.xml > output.csv

Parameters:

-t: template
-m "//book": match expression
-v "@id": get attribute value
-o ",": output literal text
-n: newline

For one-off extraction: xmlstarlet is fast and scriptable.

Python (lxml)

For programmatic extraction:

from lxml import etree
import csv

tree = etree.parse("input.xml")

with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["id", "title", "authors", "year"])

    for book in tree.xpath("//book"):
        book_id = book.get("id")
        title = book.find("title").text
        authors = ", ".join(a.text for a in book.findall("authors/author"))
        year = book.find("year").text
        writer.writerow([book_id, title, authors, year])

For complex transformations: Python is more flexible than xmlstarlet.

Pandas Approach

For analytical workflows:

import pandas as pd
from lxml import etree

tree = etree.parse("input.xml")

records = []
for book in tree.xpath("//book"):
    records.append({
        "id": book.get("id"),
        "title": book.findtext("title"),
        "authors": ", ".join(a.text for a in book.findall("authors/author")),
        "year": book.findtext("year")
    })

df = pd.DataFrame(records)
df.to_csv("output.csv", index=False)

Pandas handles type conversion, sorting, filtering after extraction.

PowerShell

For Windows environments:

[xml]$xml = Get-Content "input.xml"
$xml.SelectNodes("//book") | ForEach-Object {
    [PSCustomObject]@{
        id = $_.id
        title = $_.title
        authors = ($_.authors.author -join ", ")
        year = $_.year
    }
} | Export-Csv "output.csv" -NoTypeInformation

PowerShell's XML handling is verbose but Windows-native.

For batch processing patterns, see Batch Processing Files Guide.

Handling Namespaces

XML with namespaces requires explicit handling:

<root xmlns:book="http://example.com/book">
  <book:item>
    <book:title>Title</book:title>
  </book:item>
</root>

Python with lxml:

ns = {"book": "http://example.com/book"}
for item in tree.xpath("//book:item", namespaces=ns):
    title = item.find("book:title", namespaces=ns).text

xmlstarlet:

xmlstarlet sel -N book="http://example.com/book" \
  -t -m "//book:item" \
  -v "book:title" -n \
  input.xml

Namespaces are critical for documents that use them; ignoring them produces empty results.

Multi-row Extraction

For XML where one source produces multiple CSV rows:

<order>
  <id>1001</id>
  <items>
    <item><sku>A</sku><qty>5</qty></item>
    <item><sku>B</sku><qty>3</qty></item>
  </items>
</order>

To produce one row per item:

records = []
for order in tree.xpath("//order"):
    order_id = order.findtext("id")
    for item in order.findall("items/item"):
        records.append({
            "order_id": order_id,
            "sku": item.findtext("sku"),
            "qty": int(item.findtext("qty"))
        })

pd.DataFrame(records).to_csv("output.csv", index=False)

Result:

order_id,sku,qty
1001,A,5
1001,B,3

Streaming Large XML

For multi-GB XML files (don't fit in memory):

from lxml import etree

context = etree.iterparse("large.xml", events=("end",), tag="book")

with open("output.csv", "w") as f:
    for event, book in context:
        # Process and write each book
        # ...
        book.clear()  # Free memory

iterparse with book.clear() handles arbitrarily large XML files without memory issues.

For broader large-file handling, see Batch Text Replacement in CSV.

XPath for Complex Filters

For specific data extraction:

# Books published after 2020
xmlstarlet sel -t -m "//book[year > 2020]" \
  -v "title" -n input.xml

# Books by specific author
xmlstarlet sel -t -m "//book[authors/author='Alice']" \
  -v "title" -n input.xml

# Books with multiple authors
xmlstarlet sel -t -m "//book[count(authors/author) > 1]" \
  -v "title" -n input.xml

XPath's filtering is powerful for complex queries.

Common Issues

Empty cells in CSV output: XPath path doesn't match. Verify path with xmlstarlet el input.xml to see structure.

Special characters mangled: encoding mismatch. Force UTF-8:

df.to_csv("output.csv", encoding="utf-8")

Multi-line text loses line breaks: CSV needs escaping. Pandas handles automatically; manual processing needs explicit escaping.

Performance slow on large XML: SAX-style streaming (iterparse) much faster than DOM (parse) for large files.

Namespaces ignored: must declare and use them in XPath.

Reverse: CSV to XML

For CSV to XML:

import pandas as pd
from lxml import etree

df = pd.read_csv("input.csv")
root = etree.Element("library")

for _, row in df.iterrows():
    book = etree.SubElement(root, "book", id=str(row["id"]))
    title = etree.SubElement(book, "title")
    title.text = row["title"]
    year = etree.SubElement(book, "year")
    year.text = str(row["year"])

tree = etree.ElementTree(root)
tree.write("output.xml", pretty_print=True, xml_declaration=True, encoding="UTF-8")

CSV to XML is straightforward when the structure is flat.

Frequently Asked Questions

Should I use xmlstarlet or Python?

For one-off scripts: xmlstarlet (no setup). For complex pipelines: Python with lxml.

What about XSLT?

XSLT is the standard XML transformation language. Powerful but verbose. For complex transformations: XSLT shines. For simple extractions: XPath via xmlstarlet/Python.

Can I convert XML to JSON?

Yes:

import xmltodict
import json

with open("input.xml") as f:
    data = xmltodict.parse(f.read())

with open("output.json", "w") as f:
    json.dump(data, f, indent=2)

For JSON-to-CSV, see JSON to CSV With Nested Fields.

How do I handle attributes vs elements?

XPath: @attribute for attribute, element name for element. Both extract similarly.

What about XML schema (XSD)?

XSD describes valid XML structure. For one-off extraction: not needed. For ongoing pipelines: validates input.

Performance comparison?

For 100 MB XML: xmlstarlet ~5 seconds, Python with lxml ~3 seconds, Python with iterparse (streaming) ~3 seconds for any size. PowerShell: 30-60 seconds.

Bottom Line

For XML to CSV extraction in 2026: xmlstarlet for command-line, Python lxml for programmatic, PowerShell for Windows environments. Use XPath for selecting specific elements. Handle namespaces explicitly. For large files: streaming with iterparse. Our document converter handles related document conversions.