I am scraping a dictionary website and want to get the English translation of a word. I am using soup.find_all() to find the second instance of a tag in the page source. But the function is returning a long object because the tags are nested:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
soup.find_all('td', attrs={'class':'ToWrd'})[1]
It returns:
<td >pupil <em >n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>
But I am just interested in "pupil" which is the meaning of the word that I am searching in that dictionary website. Can anyone help how to extract this just this word?
Please, note that I don't want to use a numpy or pandas function because the code does not have these dependencies and I don't want to add them. For example, I am not looking for this solution:
pd.DataFrame(soup.find_all('td', attrs={'class':'ToWrd'})[1])[0][0]
which returns:
'pupil '
CodePudding user response:
There are different approaches to get your goal - simplest is mentioned by @Tim Roberts - But be aware that it will just work if there is a single word:
soup.find_all('td', attrs={'class':'ToWrd'})[1].text.split()[0]
An alternative, working with single / compound nouns / multiple words is stripped_strings:
list(soup.find_all('td', attrs={'class':'ToWrd'})[1].stripped_strings)[0]
Same job will also be done by combine get_text() with parameters and split(), but I prefer stripped_strings:
soup.find_all('td', attrs={'class':'ToWrd'})[1].get_text('|',strip=True).split('|')[0]
Note: If there is only one <td> with that class use find() instead of find_all()
Example
Will extract single as well as compound nouns / multiple words:
html = '''
<td >pupil<em >n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>
<td >ice cream<em >n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>
'''
soup=BeautifulSoup(html,'lxml')
[list(w.stripped_strings)[0] for w in soup.find_all('td', attrs={'class':'ToWrd'})]
Output
['pupil', 'ice cream']
CodePudding user response:
How about using a regex:
import re
valid = re.compile(r'<td >(\w ) <em >')
print(valid.match('<td >pupil <em >n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>').group(1))
returns
pupil
Example above works only if all tags have
<td >
before and
<em >
after your wanted word though. But you might adjust the regex accordingly.
