I pulled a report from a crm system that came with some special characters like:
Belgium (Dutch)
Saint Lucia
Trinidad and Tobago
Sierra Leone
Mali
Svalbard and Jan Mayen
This is a drop down menu from the web interface that contains all the countries and regions. Per what I read this is an xml formatting issue. I am processing this in Python Pandas. From this post I got an idea but I'd like to write a regex to escape any string with similar sequence of characters.
By the way, I imported the csv file like this:
df = pd.read_csv('report.csv', encoding='utf-8')
And use this to try to escape the characters (which worked for that case only):
df['Country/Region'] = df['Country/Region'].replace(to_replace='(', value= ' ', regex=False)
This is to a specific character. But I could not figure out with a regex.
CodePudding user response:
You can use the built-in function html.unescape:
import html
df['Country/Region'] = df['Country/Region'].astype(str).map(html.unescape)
Output:
>>> df
Country/Region
0 Belgium (Dutch)
1 Saint Lucia
2 Trinidad and Tobago
3 Sierra Leone
4 Mali
5 Svalbard and Jan Mayen
