Transform Nested XML-CodePudding

I am currently looking to parse out a nested XML into a pandas Datatable so I can generate a CSV with each column being an element name and the value of that being the element text but I am having some issues parsing the information out. Below is an example of the nested XML and what I have tried.

The below XML can be quite large with hundreds of different records. This is what I tried:

##Import modules
import xml.etree.ElementTree as ET
import pandas as pd
from lxml import etree

tree = ET.parse("File.xml")
root = tree.getroot()

for subelement in root:
    for subsub in subelement:
        print(subsub.tag,",", subsub.text, subsub.attrib, subsub.items())

for subelement in root:
    for subsub in subelement:
        for subsubsub in subsub:
            print(subsubsub.tag,",", subsubsub.text, subsubsub.attrib)

<?xml version="1.0" encoding="utf-16"?>
<test1 xmlns="test.xsd">
    <test2 ID="123123123" test3="123123">
        <test3>Separate</test3>
        <test4>AA</test4>
        <Comments>BB</Comments>
        <test5>
            <test6 ID="123123">
                <test3>today</test3>
                <test7>123 street</test7>
            </test6>
        </test5>
        <test8>
            <test10 ID="434234">
                <test3>type of work</test3>
                <test9>test work</test9>
            </test10>
        </test8>
        <test11>
            <test12 ID="234234234">
                <test3>Social</test3>
                <test14>test</test14>
            </test12>
            <test12 ID="123123">
                <test3>Something Here</test3>
                <test13>Some date</test13>
                <test14>123123124433</test14>
            </test12>
        </test11>
        <test15>
            <test16 ID="6456456456">
                <test3>Something Something</test3>
                <test14>746745636</test14>
            </test16>
        </test15>
    </test2>
    <test2 ID="353453245" test3="list of something">
        <test3>Somewhere</test3>
        <test4>Someone</test4>
        <Comments>Some comment</Comments>
        <test5>
            <test6 ID="567456756">
                <test3>Not today</test3>
                <test7>5634643643</test7>
                <test17>Some Info</test17>
                <test19>Somewhere</test19>
                <test18>63243333</test18>
            </test6>
        </test5>
        <test11>
            <test12 ID="456436346">
                <test3>Pattern</test3>
                <test14>436346346</test14>
            </test12>
            <test12 ID="4364356">
                <test3> ID</test3>
                <test14>5674567457</test14>
            </test12>
            <test12 ID="123123123443">
                <test3>Other ID</test3>
                <test13>54234532452345</test13>
                <test14>231423532452345</test14>
            </test12>
        </test11>
        <test15>
            <test16 ID="34252345">
                <test3>None test</test3>
                <test14>456436436346</test14>
            </test16>
        </test15>
    </test2>
</test1>

Update So would the full code look something like this?

###TEST USING EXAMPLE HOTLIST
with open("file.csv", "w", newline='') as fout:
    header = ['test3','test4','test7','test9','test13','test14','test17','test18','test19','Comments']
    csvout = csv.DictWriter(fout, fieldnames=header)
    csvout.writeheader()
    row = {}
    for _, elem in ET.iterparse('file.xml'):
        # strip the namespace from the element tag name; e.g. {Test.xsd}test14 > test14
        tag = re.sub("^{.*?}", "", elem.tag)
        if tag == 'test2':
            if len(row) != 0:
                print(row)
                csvout.writerow(row)
                row = {}
        if len(elem) == 0:
            text = elem.text
            old = row.get(tag)
            if old is None:
                # first occurrence of the tag
                row[tag] = text
            elif isinstance(old, str):
                # second occurrence of the tag
                row[tag] = [old, text]
            else:
                # already a list
                old.append(text)

CodePudding user response：

For nested XML you can use iterparse() function to iterate over all elements in the XML. You would then need to have logic to handle the elements depending on what tag it's looking at to add to a dictionary object to export as a row.

for _, elem in ET.iterparse('file.xml'):
    if len(elem) == 0:
        print(f'{elem.tag} {elem.attrib} text={elem.text}')
    else:
        print(f'{elem.tag} {elem.attrib}')

To create a row in a CSV file from the element text then can do something like this. If, for example, the "test2" marks the beginning of a new record then that can be used to write the record to a new row and clear the dictionary for the next record.

If want to output all or some attributes then need to add a few lines of code for that. If attribute names have the same name as element name or multiple elements have same attribute (e.g. ID) then need to address that in your code.

import xml.etree.ElementTree as ET
import re
import csv

with open("out.csv", "w", newline='') as fout:
    header = ['test3','test4','test7','test9','test13','test14','test17','test18','test19','Comments']
    csvout = csv.DictWriter(fout, fieldnames=header)
    csvout.writeheader()
    row = {}
    for _, elem in ET.iterparse('test.xml'):
        # strip the namespace from the element tag name; e.g. {Test.xsd}test14 > test14
        tag = re.sub("^{.*?}", "", elem.tag)
        if tag == 'test2':
            if len(row) != 0:
                print(row)
                csvout.writerow(row)
                row = {}
        if len(elem) == 0:
            row[tag] = elem.text

Output:

{'test3': 'Something Something', 'test4': 'AA', 'Comments': 'BB', 'test7': '123 street', 'test9': 'test work', 'test14': '746745636', 'test13': 'Some date'}
{'test3': 'None test', 'test4': 'Someone', 'Comments': 'Some comment', 'test7': '5634643643', 'test17': 'Some Info', 'test19': 'Somewhere', 'test18': '63243333', 'test14': '456436436346', 'test13': '54234532452345'}

CSV Output:

test3,test4,test7,test9,test13,test14,test17,test18,test19,Comments
Something Something,AA,123 street,test work,Some date,746745636,,,,BB
None test,Someone,5634643643,,54234532452345,456436436346,Some Info,63243333,Somewhere,Some comment

Update:

If want to handle duplicate tags and create a list of values then try something like this:

if len(elem) == 0:
    text = elem.text
    old = row.get(tag)
    if old is None:
        # first occurrence
        row[tag] = text
    elif isinstance(old, str):
        # second occurrence > create list
        row[tag] = [old, text]
    else:
        old.append(text)