I have a collection of xml files that I would like to read in to either a dataframe (df) or a dictionary (dict). Each xml file has the same format with regard to the classes.
import os
import glob
import re
import numpy as np
import pandas as pd
import sys
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
import xml.etree.ElementTree as ET
And then the next code works
path = '../mycollection'
files = os.listdir(path)
print(len(files))
I am unsure of how to proceed . Any help would be brilliant
CodePudding user response:
You can use some library such as xmltodict or write your own parser. From xmltodict readme:
>>> xml = """
... <root xmlns="http://defaultns.com/"
... xmlns:a="http://a.com/"
... xmlns:b="http://b.com/">
... <x>1</x>
... <a:y>2</a:y>
... <b:z>3</b:z>
... </root>
... """
>>> xmltodict.parse(xml, process_namespaces=True) == {
... 'http://defaultns.com/:root': {
... 'http://defaultns.com/:x': '1',
... 'http://a.com/:y': '2',
... 'http://b.com/:z': '3',
... }
... }
... True
