How to scrap data in a text that has nested tags?-CodePudding

I am scraping a dictionary website and want to get the English translation of a word. I am using soup.find_all() to find the second instance of a tag in the page source. But the function is returning a long object because the tags are nested:

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
soup.find_all('td', attrs={'class':'ToWrd'})[1]

It returns:

<td >pupil <em >n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>

But I am just interested in "pupil" which is the meaning of the word that I am searching in that dictionary website. Can anyone help how to extract this just this word?

Please, note that I don't want to use a numpy or pandas function because the code does not have these dependencies and I don't want to add them. For example, I am not looking for this solution:

pd.DataFrame(soup.find_all('td', attrs={'class':'ToWrd'})[1])[0][0]

which returns:

'pupil '

CodePudding user response：

There are different approaches to get your goal - simplest is mentioned by @Tim Roberts - But be aware that it will just work if there is a single word:

soup.find_all('td', attrs={'class':'ToWrd'})[1].text.split()[0]

An alternative, working with single / compound nouns / multiple words is stripped_strings:

list(soup.find_all('td', attrs={'class':'ToWrd'})[1].stripped_strings)[0]

Same job will also be done by combine get_text() with parameters and split(), but I prefer stripped_strings:

soup.find_all('td', attrs={'class':'ToWrd'})[1].get_text('|',strip=True).split('|')[0]

Note: If there is only one <td> with that class use find() instead of find_all()

Example

Will extract single as well as compound nouns / multiple words:

html = '''
<td >pupil<em >n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>
<td >ice cream<em >n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>
'''

soup=BeautifulSoup(html,'lxml')

[list(w.stripped_strings)[0] for w in soup.find_all('td', attrs={'class':'ToWrd'})]

Output

['pupil', 'ice cream']

CodePudding user response：

How about using a regex:

import re

valid = re.compile(r'<td >(\w ) <em >')
print(valid.match('<td >pupil <em >n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>').group(1))

returns

pupil

Example above works only if all tags have

<td >

before and

<em >

after your wanted word though. But you might adjust the regex accordingly.