Home > OS >  XML tag is getting converted to an inline tag when parsed through BeautifulSoup
XML tag is getting converted to an inline tag when parsed through BeautifulSoup

Time:01-28

I am having an XML which I am trying to parse using BeautifulSoup. I am able to retrieve everything I want except for the text between the "col" tags.

<?xml version="1.0" encoding="UTF-8"?>
<survey>
<radio 
  label="Q2"
  averages="cols"
  type="rating">
  <title>How much do you trust your employer to do what is right?</title>
  <comment>Please use a 9-point scale where 1 means you ‘don’t trust them at all’ and 9 means you ‘trust them a great deal’ to do what is right.</comment>
  <col label="c1" value="1">1<br />Don’t trust them at all</col>
  <col label="c2" value="2">2</col>
  <col label="c3" value="3">3</col>
  <col label="c4" value="4">4</col>
  <col label="c5" value="5">5</col>
  <col label="c6" value="6">6</col>
  <col label="c7" value="7">7</col>
  <col label="c8" value="8">8</col>
  <col label="c9" value="9">9<br />Trust them a great deal</col>
  <col label="c99" value="99">Don’t know</col>
</radio>

<suspend/>
</survey>

When I parse this to BeautifulSoup using the command soup = BeautifulSoup(open(file=file_path, mode='r ', encoding='utf8', errors='ignore'), 'html.parser') and print the soup contents I am getting the below output

<?xml version="1.0" encoding="UTF-8"?>
<survey>
<radio averages="cols" label="Q2" type="rating">
<title>How much do you trust your employer to do what is right?</title>
<comment>Please use a 9-point scale where 1 means you dont trust them at all and 9 means you trust them a great deal to do what is right.</comment>
<col label="c1" value="1"/>1<br/>Dont trust them at all
  <col label="c2" value="2"/>2
  <col label="c3" value="3"/>3
  <col label="c4" value="4"/>4
  <col label="c5" value="5"/>5
  <col label="c6" value="6"/>6
  <col label="c7" value="7"/>7
  <col label="c8" value="8"/>8
  <col label="c9" value="9"/>9<br/>Trust them a great deal
  <col label="c99" value="99"/>Dont know
</radio>
<suspend></suspend>
</survey>

Any help is really appreciated!!

CodePudding user response:

You're using BeautifulSoup in a weird way to parse the XML file.

Try this:

from bs4 import BeautifulSoup

xml_sample = """<?xml version="1.0" encoding="UTF-8"?>
<survey>
<radio 
  label="Q2"
  averages="cols"
  type="rating">
  <title>How much do you trust your employer to do what is right?</title>
  <comment>Please use a 9-point scale where 1 means you ‘don’t trust them at all’ and 9 means you ‘trust them a great deal’ to do what is right.</comment>
  <col label="c1" value="1">1<br />Don’t trust them at all</col>
  <col label="c2" value="2">2</col>
  <col label="c3" value="3">3</col>
  <col label="c4" value="4">4</col>
  <col label="c5" value="5">5</col>
  <col label="c6" value="6">6</col>
  <col label="c7" value="7">7</col>
  <col label="c8" value="8">8</col>
  <col label="c9" value="9">9<br />Trust them a great deal</col>
  <col label="c99" value="99">Don’t know</col>
</radio>

<suspend/>
</survey>
"""

print("\n".join(c.getText() for c in BeautifulSoup(xml_sample, features="xml").find_all("col")))

Output:

1Don’t trust them at all
2
3
4
5
6
7
8
9Trust them a great deal
Don’t know
  •  Tags:  
  • Related