I have some sentences like the following
w = "Chiang Wei-kuo (traditional Chinese: 蔣緯國; simplified Chinese: 蒋纬国; pinyin: Jiǎng Wěiguó, or Wego Chiang; and Nadezhda Nikitichna Mikhalkova (Russian: Надежда Никитична Михалкова;"
and I am using the following regular expression to capture non-latin words (蔣緯國, 蒋纬国, Надежда Никитична Михалкова)
for match in re.finditer(r'(?<=:\s)\W (?=;)', w):
print(match[0])
So I am trying to capture any non-word characters \W between the symbol : and the symbol ; . But it doesn't seem to be working. I also tried substituting \W with [^a-zA-Z0-9_], but still it doesn't work. Any help on this?
CodePudding user response:
You could also use a flag re.ASCII
import re
w = "Chiang Wei-kuo (traditional Chinese: 蔣緯國; simplified Chinese: 蒋纬国; pinyin: Jiǎng Wěiguó, or Wego Chiang; and Nadezhda Nikitichna Mikhalkova (Russian: Надежда Никитична Михалкова;"
print(re.findall(r'(?<=:\s)\W (?=;)', w, re.ASCII))
Output
['蔣緯國', '蒋纬国', 'Надежда Никитична Михалкова']
CodePudding user response:
You may use:
>>> re.findall(r'(?<=:\s)[^\s\dA-Za-z_][^;]*(?=;)', w)
['蔣緯國', '蒋纬国', 'Надежда Никитична Михалкова']
\W is unicode compliant to match any non-word unicode character only.
