How to escape special characters in Python?-CodePudding

I pulled a report from a crm system that came with some special characters like:

Belgium&#x20;&#x28;Dutch&#x29;                                
Saint&#x20;Lucia                                              
Trinidad&#x20;and&#x20;Tobago                                 
Sierra&#x20;Leone                                             
Mali                                                          
Svalbard&#x20;and&#x20;Jan&#x20;Mayen

This is a drop down menu from the web interface that contains all the countries and regions. Per what I read this is an xml formatting issue. I am processing this in Python Pandas. From this post I got an idea but I'd like to write a regex to escape any string with similar sequence of characters.

By the way, I imported the csv file like this:

df = pd.read_csv('report.csv', encoding='utf-8')

And use this to try to escape the characters (which worked for that case only):

df['Country/Region'] = df['Country/Region'].replace(to_replace='&#x28;', value= ' ', regex=False)

This is to a specific character. But I could not figure out with a regex.

CodePudding user response：

You can use the built-in function html.unescape:

import html
df['Country/Region'] = df['Country/Region'].astype(str).map(html.unescape)

Output:

>>> df
           Country/Region
0         Belgium (Dutch)
1             Saint Lucia
2     Trinidad and Tobago
3            Sierra Leone
4                    Mali
5  Svalbard and Jan Mayen