Home > Blockchain >  Encoding issues with Python and beautifoulsoup
Encoding issues with Python and beautifoulsoup

Time:01-19

I am going to scrape some pages from Amazon. I would like to store the title about some products. But I have a problem with the encoding.

def get_information_products(href):
    url = 'https://www.amazon.fr'   href
    url = Request(url)
    ua = UserAgent()
    url.add_header('User-Agent', ua.random)
    
    with urlopen(url) as f:
        data = f.readlines()  
    
    page_soup = soup(str(data), 'html.parser', from_encoding='iso-8859-1')
    title_list = []
    
    try:
        title = page_soup.find("span", attrs={"id": 'productTitle'})
        print(title.get_text(strip=True))
        return title.get_text(strip=True)
    except:
        return ''

This is the piece of code which gets the data. After that I am going to save the data to csv. But I have always the same issues. My product title are like that:

OVO Sthira - Lot de 2 Briques de Yoga en Li\xc3\xa8ge Premium - Ultra Fin - Bloc Yoga - Brique Yoga - Block Yoga - Accessoire de Yoga \xc3\xa9cologique

I don't know how to save the data with the right character...

CodePudding user response:

It seems you're page title is in UTF8, can you try this :

str = title.get_text(strip=True)
str.encode("windows-1252").decode('utf8')

If it's a plain string, you may need an extra step:

str.decode("utf-8").encode("windows-1252").decode("utf-8")

CodePudding user response:

You might try to use unicodedata module

import unicodedata

unicodedata.normalize("NFKD",your_text)

  •  Tags:  
  • Related