How to filter rows with non Latin characters-CodePudding

I am stuck in a problem with a dataframe with a column of film names which has a bunch of non-latin names like Japanese or Chinese (and maybe Russian names too) my code is:

df['title'].head(5)

1 I am legend
2 wonder women
3 アライヴ
4 怪獣総進撃
5 dead sea

I just want an output that removes every non-Latin character title, so I want to remove every row that contains character similar to row 3 and 4, so my desired output is:

df['title'].head(5)

1 I am legend
2 wonder women
5 dead sea
6 the rig
7 altitude

Any help with this code?

CodePudding user response：

You can encode your title column then decode to latin1. If this double transformation does not match your original data, remove row because it contains some non Latin characters:

df = df[df['title'] == df['title'].str.encode('unicode_escape').str.decode('latin1')]
print(df)

# Output
          title
0   I am legend
1  wonder women
3      dead sea

CodePudding user response：

You can use str.match with the Latin character range to identify non-latin characters, and use the boolean output to slice the data:

df_latin = df[~df['title'].str.match(r'.*[^\x00-\xFF]')]

output:

          title
1   I am legend
2  wonder women
5      dead sea
6       the rig
7      altitude

CodePudding user response：

You can use the isascii() method (if you're using Python 3.7 ). Example:

"I am legend".isascii()  # True
"アライヴ".isascii()  # False

Even if you have 1 Non-English letter, the isascii() method will return False.

(Note that for strings like '34?#5' the method will return True, because those are all ASCII characters.)

CodePudding user response：

We can easily makes a function which will return whether it is ascii or not and based on that we can then filter our dataframe.

dict_1 = {'col1':list(range(1,6)), 'col2':['I am legend','wonder women','アライヴ','怪獣総進撃','dead sea']}

def check_ascii(string):
    if string.isascii() == True:
        return True
    else:
        return False
    
df = pd.DataFrame(dict_1)
df['is_eng'] = df['col2'].apply(lambda x: check_ascii(x))
df2 = df[df['is_eng'] == True]
df2

Output