I am going to scrape some pages from Amazon. I would like to store the title about some products. But I have a problem with the encoding.
def get_information_products(href):
url = 'https://www.amazon.fr' href
url = Request(url)
ua = UserAgent()
url.add_header('User-Agent', ua.random)
with urlopen(url) as f:
data = f.readlines()
page_soup = soup(str(data), 'html.parser', from_encoding='iso-8859-1')
title_list = []
try:
title = page_soup.find("span", attrs={"id": 'productTitle'})
print(title.get_text(strip=True))
return title.get_text(strip=True)
except:
return ''
This is the piece of code which gets the data. After that I am going to save the data to csv. But I have always the same issues. My product title are like that:
OVO Sthira - Lot de 2 Briques de Yoga en Li\xc3\xa8ge Premium - Ultra Fin - Bloc Yoga - Brique Yoga - Block Yoga - Accessoire de Yoga \xc3\xa9cologique
I don't know how to save the data with the right character...
CodePudding user response:
It seems you're page title is in UTF8, can you try this :
str = title.get_text(strip=True)
str.encode("windows-1252").decode('utf8')
If it's a plain string, you may need an extra step:
str.decode("utf-8").encode("windows-1252").decode("utf-8")
CodePudding user response:
You might try to use unicodedata module
import unicodedata
unicodedata.normalize("NFKD",your_text)
