Python read in collection of xml files to df or dict-CodePudding

I have a collection of xml files that I would like to read in to either a dataframe (df) or a dictionary (dict). Each xml file has the same format with regard to the classes.

import os
import glob
import re
import numpy as np
import pandas as pd
import sys

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

import xml.etree.ElementTree as ET

And then the next code works

path = '../mycollection'


files = os.listdir(path)
print(len(files))

I am unsure of how to proceed . Any help would be brilliant

CodePudding user response：

You can use some library such as xmltodict or write your own parser. From xmltodict readme:

>>> xml = """
... <root xmlns="http://defaultns.com/"
...       xmlns:a="http://a.com/"
...       xmlns:b="http://b.com/">
...   <x>1</x>
...   <a:y>2</a:y>
...   <b:z>3</b:z>
... </root>
... """
>>> xmltodict.parse(xml, process_namespaces=True) == {
...     'http://defaultns.com/:root': {
...         'http://defaultns.com/:x': '1',
...         'http://a.com/:y': '2',
...         'http://b.com/:z': '3',
...     }
... }
... True