Home > Software design >  How can I get the actual text from a beautiful soup class tag?
How can I get the actual text from a beautiful soup class tag?

Time:01-11

  • Python Version: 3.8
  • bs4 library

I have the following HTML which represents 2 of about 20 reviews I have scraped. I didn't include the rest here because of space, but you can imagine that these blocks keep repeating.

I need to retrieve "sml-rank-stars sml-str40 star" (as seen in the second line here) from each review.

<div >
<span ></span>
<span >
<span >
                                                        口味:3.5
                                                    </span>
<span >
                                                        环境:4.0
                                                    </span>
<span >
                                                        服务:3.5
                                                    </span>
<span >人均:200元</span>
</span>
</div>
<div >
<span ></span>
<span >
<span >
                                                        口味:3.0
                                                    </span>
<span >
                                                        环境:4.5
                                                    </span>
<span >
                                                        服务:3.0
                                                    </span>
</span>
</div>

Here is what I have tried so far:

for review in review_items.find_all('div', class_='main-review'):
    review_rank = review.find('div', class_='review-rank')

    star_rank = []
    for review in review_rank.find_all('span')[:1]:
        star_rank.append(review.get('class'))

print(star_rank)

I get the resulting output:

[['sml-rank-stars', 'sml-str5', 'star']]

I can then use this code to get the number only:

star_rank[0][1][7:]

Output:

'5'

The problem with this is I am only getting one of the reviews, I need this line for every review stored in my list.

My desired output something like this or something that I can iterate over to get the number of stars for each review:

[['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str35', 'star'],
['sml-rank-stars', 'sml-str50', 'star'],
['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str50', 'star'],
['sml-rank-stars', 'sml-str50', 'star'],
['sml-rank-stars', 'sml-str45', 'star'],
['sml-rank-stars', 'sml-str10', 'star'],
['sml-rank-stars', 'sml-str35', 'star'],
['sml-rank-stars', 'sml-str45', 'star'],
['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str45', 'star'],
['sml-rank-stars', 'sml-str10', 'star'],
['sml-rank-stars', 'sml-str5', 'star']]

I have figured out how to print out a result like this with the following code, but I need it saved into a list or something else I can iterate over.

for review in review_items.find_all('div', class_='main-review'):
    review_rank = review.find('div', class_='review-rank')

    for review in review_rank.find_all('span')[:1]:
        print(review.get('class'))

Output:

['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str35', 'star']
['sml-rank-stars', 'sml-str50', 'star']
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str50', 'star']
['sml-rank-stars', 'sml-str50', 'star']
['sml-rank-stars', 'sml-str45', 'star']
['sml-rank-stars', 'sml-str10', 'star']
['sml-rank-stars', 'sml-str35', 'star']
['sml-rank-stars', 'sml-str45', 'star']
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str45', 'star']
['sml-rank-stars', 'sml-str10', 'star']
['sml-rank-stars', 'sml-str5', 'star']

CodePudding user response:

To iterate over all .review-rank select all of them - To get the the rank only use a list comprehension:

star_rank = []
for r in soup.select('.review-rank'):
    star_rank.append([s.replace('sml-str','') for s in r.span['class'] if 'sml-str' in s][0])

or as in your example, do not know the genaral structure what is above review_items and if there is only one or many:

star_rank = []
for review in review_items.find_all('div', class_='main-review'):
    for review in review.find_all('div', class_='review-rank'):
        star_rank.append([s.replace('sml-str','') for s in review.span['class'] if 'sml-str' in s][0])

Output

['40', '35']
  •  Tags:  
  • Related