Why XML to CSV Is Non-Trivial
XML stores hierarchical data with arbitrary nesting:
<library>
<book id="1">
<title>Book Title</title>
<authors>
<author>Alice</author>
<author>Bob</author>
</authors>
<year>2024</year>
</book>
</library>
CSV is flat:
id,title,authors,year
1,Book Title,"Alice, Bob",2024
The conversion needs to flatten the hierarchy: combine multiple authors, drop nesting markers, decide which fields go to columns.
XPath is the standard query language for XML. With XPath, you can extract specific elements without parsing the entire document. This post covers the practical workflows.
For broader CSV processing, see Batch Text Replacement in CSV.
XPath Basics
XPath syntax for selecting elements:
| Expression | Meaning |
|---|---|
/library/book | Direct children |
//book | Any descendant |
book[@id="1"] | Attribute filter |
book[year > 2020] | Numeric filter |
book/authors/author | Nested traversal |
book[1] | Position (first) |
book[last()] | Last element |
name() | Element name |
text() | Text content |
@id | Attribute value |
For most extractions: simple paths plus filters.
xmlstarlet (CLI)
xmlstarlet is the command-line XML toolkit:
# Install
brew install xmlstarlet # Mac
sudo apt install xmlstarlet # Linux
# Extract values
xmlstarlet sel -t -m "//book" \
-v "@id" -o "," \
-v "title" -o "," \
-v "year" -n \
input.xml > output.csv
Parameters:
-t: template-m "//book": match expression-v "@id": get attribute value-o ",": output literal text-n: newline
For one-off extraction: xmlstarlet is fast and scriptable.
Python (lxml)
For programmatic extraction:
from lxml import etree
import csv
tree = etree.parse("input.xml")
with open("output.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["id", "title", "authors", "year"])
for book in tree.xpath("//book"):
book_id = book.get("id")
title = book.find("title").text
authors = ", ".join(a.text for a in book.findall("authors/author"))
year = book.find("year").text
writer.writerow([book_id, title, authors, year])
For complex transformations: Python is more flexible than xmlstarlet.
Pandas Approach
For analytical workflows:
import pandas as pd
from lxml import etree
tree = etree.parse("input.xml")
records = []
for book in tree.xpath("//book"):
records.append({
"id": book.get("id"),
"title": book.findtext("title"),
"authors": ", ".join(a.text for a in book.findall("authors/author")),
"year": book.findtext("year")
})
df = pd.DataFrame(records)
df.to_csv("output.csv", index=False)
Pandas handles type conversion, sorting, filtering after extraction.
PowerShell
For Windows environments:
[xml]$xml = Get-Content "input.xml"
$xml.SelectNodes("//book") | ForEach-Object {
[PSCustomObject]@{
id = $_.id
title = $_.title
authors = ($_.authors.author -join ", ")
year = $_.year
}
} | Export-Csv "output.csv" -NoTypeInformation
PowerShell's XML handling is verbose but Windows-native.
For batch processing patterns, see Batch Processing Files Guide.
Handling Namespaces
XML with namespaces requires explicit handling:
<root xmlns:book="http://example.com/book">
<book:item>
<book:title>Title</book:title>
</book:item>
</root>
Python with lxml:
ns = {"book": "http://example.com/book"}
for item in tree.xpath("//book:item", namespaces=ns):
title = item.find("book:title", namespaces=ns).text
xmlstarlet:
xmlstarlet sel -N book="http://example.com/book" \
-t -m "//book:item" \
-v "book:title" -n \
input.xml
Namespaces are critical for documents that use them; ignoring them produces empty results.
Multi-row Extraction
For XML where one source produces multiple CSV rows:
<order>
<id>1001</id>
<items>
<item><sku>A</sku><qty>5</qty></item>
<item><sku>B</sku><qty>3</qty></item>
</items>
</order>
To produce one row per item:
records = []
for order in tree.xpath("//order"):
order_id = order.findtext("id")
for item in order.findall("items/item"):
records.append({
"order_id": order_id,
"sku": item.findtext("sku"),
"qty": int(item.findtext("qty"))
})
pd.DataFrame(records).to_csv("output.csv", index=False)
Result:
order_id,sku,qty
1001,A,5
1001,B,3
Streaming Large XML
For multi-GB XML files (don't fit in memory):
from lxml import etree
context = etree.iterparse("large.xml", events=("end",), tag="book")
with open("output.csv", "w") as f:
for event, book in context:
# Process and write each book
# ...
book.clear() # Free memory
iterparse with book.clear() handles arbitrarily large XML files without memory issues.
For broader large-file handling, see Batch Text Replacement in CSV.
XPath for Complex Filters
For specific data extraction:
# Books published after 2020
xmlstarlet sel -t -m "//book[year > 2020]" \
-v "title" -n input.xml
# Books by specific author
xmlstarlet sel -t -m "//book[authors/author='Alice']" \
-v "title" -n input.xml
# Books with multiple authors
xmlstarlet sel -t -m "//book[count(authors/author) > 1]" \
-v "title" -n input.xml
XPath's filtering is powerful for complex queries.
Common Issues
Empty cells in CSV output: XPath path doesn't match. Verify path with xmlstarlet el input.xml to see structure.
Special characters mangled: encoding mismatch. Force UTF-8:
df.to_csv("output.csv", encoding="utf-8")
Multi-line text loses line breaks: CSV needs escaping. Pandas handles automatically; manual processing needs explicit escaping.
Performance slow on large XML: SAX-style streaming (iterparse) much faster than DOM (parse) for large files.
Namespaces ignored: must declare and use them in XPath.
Reverse: CSV to XML
For CSV to XML:
import pandas as pd
from lxml import etree
df = pd.read_csv("input.csv")
root = etree.Element("library")
for _, row in df.iterrows():
book = etree.SubElement(root, "book", id=str(row["id"]))
title = etree.SubElement(book, "title")
title.text = row["title"]
year = etree.SubElement(book, "year")
year.text = str(row["year"])
tree = etree.ElementTree(root)
tree.write("output.xml", pretty_print=True, xml_declaration=True, encoding="UTF-8")
CSV to XML is straightforward when the structure is flat.
Frequently Asked Questions
Should I use xmlstarlet or Python?
For one-off scripts: xmlstarlet (no setup). For complex pipelines: Python with lxml.
What about XSLT?
XSLT is the standard XML transformation language. Powerful but verbose. For complex transformations: XSLT shines. For simple extractions: XPath via xmlstarlet/Python.
Can I convert XML to JSON?
Yes:
import xmltodict
import json
with open("input.xml") as f:
data = xmltodict.parse(f.read())
with open("output.json", "w") as f:
json.dump(data, f, indent=2)
For JSON-to-CSV, see JSON to CSV With Nested Fields.
How do I handle attributes vs elements?
XPath: @attribute for attribute, element name for element. Both extract similarly.
What about XML schema (XSD)?
XSD describes valid XML structure. For one-off extraction: not needed. For ongoing pipelines: validates input.
Performance comparison?
For 100 MB XML: xmlstarlet ~5 seconds, Python with lxml ~3 seconds, Python with iterparse (streaming) ~3 seconds for any size. PowerShell: 30-60 seconds.
Related Reading
Bottom Line
For XML to CSV extraction in 2026: xmlstarlet for command-line, Python lxml for programmatic, PowerShell for Windows environments. Use XPath for selecting specific elements. Handle namespaces explicitly. For large files: streaming with iterparse. Our document converter handles related document conversions.



