weird character printed when web scraping-CodePudding

I tried writing some code to find and print the price of a specific book but when I ran the code it returned "Â£54.23".

What is "Â"? How do I make it go away?

From my understanding I'm supposed to copy the CSS path for soup.select but since this option did not show up on chrome I copied selector. Could this be responsible for "Â"?

Here's my Python code:

import requests
from bs4 import BeautifulSoup

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
headers = {'User-Agent': user_agent}
res_obj = requests.get('http://books.toscrape.com/')
res_obj.raise_for_status()
soup = BeautifulSoup(res_obj.text, 'html.parser')
sapiens_price = soup.select('#default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(5) > article > div.product_price > p.price_color')
print(sapiens_price[0].text)

CodePudding user response：

try this:

soup = BeautifulSoup(res_obj.text, 'html.parser')

sapiens_price = soup.select('#default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(5) > article > div.product_price > p.price_color')

print(sapiens_price[0].text.encode('ascii', 'ignore').decode())

CodePudding user response：

The reason is that response.text is not using the correct encoding.

See requests documentation, and notice this:

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text

In your case, if you run your code in an IDLE, this is what you get when checking the encoding:

>>> res_obj.encoding
'ISO-8859-1'

Again from the documentation:

If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text

To override this guessed the encoding simply set the new encoding. In your case, it will be UTF-8:

>>> res_obj.encoding='UTF-8'

Do this before accessing res_obj.text and your code will work correctly:.

res_obj = requests.get('http://books.toscrape.com/')
 # SET ENCODING MANUALLY
res_obj.encoding='utf-8'
soup = BeautifulSoup(res_obj.text, 'html.parser')
sapiens_price = soup.select('...')
print(sapiens_price[0].text)

TLDR; use res.encoding='utf-8' before accessing res.text.