- Python Version: 3.8
- bs4 library
I have the following HTML which represents 2 of about 20 reviews I have scraped. I didn't include the rest here because of space, but you can imagine that these blocks keep repeating.
I need to retrieve "sml-rank-stars sml-str40 star" (as seen in the second line here) from each review.
<div >
<span ></span>
<span >
<span >
口味:3.5
</span>
<span >
环境:4.0
</span>
<span >
服务:3.5
</span>
<span >人均:200元</span>
</span>
</div>
<div >
<span ></span>
<span >
<span >
口味:3.0
</span>
<span >
环境:4.5
</span>
<span >
服务:3.0
</span>
</span>
</div>
Here is what I have tried so far:
for review in review_items.find_all('div', class_='main-review'):
review_rank = review.find('div', class_='review-rank')
star_rank = []
for review in review_rank.find_all('span')[:1]:
star_rank.append(review.get('class'))
print(star_rank)
I get the resulting output:
[['sml-rank-stars', 'sml-str5', 'star']]
I can then use this code to get the number only:
star_rank[0][1][7:]
Output:
'5'
The problem with this is I am only getting one of the reviews, I need this line for every review stored in my list.
My desired output something like this or something that I can iterate over to get the number of stars for each review:
[['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str35', 'star'],
['sml-rank-stars', 'sml-str50', 'star'],
['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str50', 'star'],
['sml-rank-stars', 'sml-str50', 'star'],
['sml-rank-stars', 'sml-str45', 'star'],
['sml-rank-stars', 'sml-str10', 'star'],
['sml-rank-stars', 'sml-str35', 'star'],
['sml-rank-stars', 'sml-str45', 'star'],
['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str45', 'star'],
['sml-rank-stars', 'sml-str10', 'star'],
['sml-rank-stars', 'sml-str5', 'star']]
I have figured out how to print out a result like this with the following code, but I need it saved into a list or something else I can iterate over.
for review in review_items.find_all('div', class_='main-review'):
review_rank = review.find('div', class_='review-rank')
for review in review_rank.find_all('span')[:1]:
print(review.get('class'))
Output:
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str35', 'star']
['sml-rank-stars', 'sml-str50', 'star']
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str50', 'star']
['sml-rank-stars', 'sml-str50', 'star']
['sml-rank-stars', 'sml-str45', 'star']
['sml-rank-stars', 'sml-str10', 'star']
['sml-rank-stars', 'sml-str35', 'star']
['sml-rank-stars', 'sml-str45', 'star']
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str45', 'star']
['sml-rank-stars', 'sml-str10', 'star']
['sml-rank-stars', 'sml-str5', 'star']
CodePudding user response:
To iterate over all .review-rank select all of them - To get the the rank only use a list comprehension:
star_rank = []
for r in soup.select('.review-rank'):
star_rank.append([s.replace('sml-str','') for s in r.span['class'] if 'sml-str' in s][0])
or as in your example, do not know the genaral structure what is above review_items and if there is only one or many:
star_rank = []
for review in review_items.find_all('div', class_='main-review'):
for review in review.find_all('div', class_='review-rank'):
star_rank.append([s.replace('sml-str','') for s in review.span['class'] if 'sml-str' in s][0])
Output
['40', '35']
