XML elementtree one at a time in memory-CodePudding

I have quite a large xml file and relatively little memory. Whilst parsing the file, i am currently loading the entire file into memory as seen in the code snippet below which slows the whole computer down and sometimes doesnt even work. I was wondering if there is a way to only load one item into memory at a time? Perhaps using some multiprocessing like in deep learning when the next file is loaded whilst the current one is being processed.

root = ET.parse("my_file.xml").getroot()

for child in root:
    do_something()

CodePudding user response：

Use the .iterparse function. Parse large XML files incrementally. This is good for large XML files where you don't have to load the whole file into memory.

From the documentation:

xml.etree.ElementTree.iterparse(source, events=None, parser=None)

Parses an XML section into an element tree incrementally, and reports what’s going on to the user. source is a filename or file object containing XML data. events is a sequence of events to report back. The supported events are the strings "start", "end", "comment", "pi", "start-ns" and "end-ns" (the “ns” events are used to get detailed namespace information). If events is omitted, only "end" events are reported. parser is an optional parser instance. If not given, the standard XMLParser parser is used. parser must be a subclass of XMLParser and can only use the default TreeBuilder as a target. Returns an iterator providing (event, elem) pairs.

It's a blocking function, so it's not asychronous. If you have several XML readers at the same time:

Note that while iterparse() builds the tree incrementally, it issues blocking reads on source (or the file it names). As such, it’s unsuitable for applications where blocking reads can’t be made. For fully non-blocking parsing, see XMLPullParser.

CodePudding user response：

There's only one mention of the word "memory" in the entire etree documentation, and it's in the context of iterparse() not reading the whole XML into memory:

If you don’t mind your application blocking on reading XML data but would still like to have incremental parsing capabilities, take a look at iterparse(). It can be useful when you’re reading a large XML document and don’t want to hold it wholly in memory.

However, the docs don't show any example of how to use iterparse. PyMOTW 3 has a really good write-up on the entire etree module and includes an example of iterparse.

Without any more context on what do_something() does, the simplest implementation for you would look like:

from xml.etree import ElementTree as ET

for _, elem in ET.iterparse('input.xml'):
    do_something(elem)