current code:
txt = "Jeor MORMONT, Lord COMMANDER of the NIGHT'S WATCH."
print(re.findall(r"\w |\W ", txt))
output:
['Jeor', ' ', 'MORMONT', ', ', 'Lord', ' ', 'COMMANDER', ' ', 'of', ' ', 'the', ' ', 'NIGHT', "'", 'S', ' ', 'WATCH', '.']
desired output:
['Jeor', ' ', 'MORMONT', ', ', 'Lord', ' ', 'COMMANDER', ' ', 'of', ' ', 'the', ' ', 'NIGHT'S', ' ', 'WATCH', '.']
CodePudding user response:
Try this:
txt = "Jeor MORMONT, Lord COMMANDER of the NIGHT'S WATCH."
print(re.findall(r"[\w |\']*|\W ", txt))
CodePudding user response:
You need to use a character set.
You can accomplish this by using brackets [ ]. When using a character set, one of the characters in the set will be matched.
As you want either a word character or ', you should use:
[\w'] |\W
[ ]: A character set, matches one of the following options.\w: A word character (the same as[a-zA-Z0-9_]).': The symbol', there is no need to escape it.
print(re.findall(r"[\w'] |\W ", txt))
# ['Jeor', ' ', 'MORMONT', ', ', 'Lord', ' ', 'COMMANDER', ' ', 'of', ' ', 'the', ' ', "NIGHT'S", ' ', 'WATCH', '.']
CodePudding user response:
You just need to explore regex a bit more
>>> print(re.findall(r"[a-zA-Z\'] ", txt))
['Jeor', 'MORMONT', 'Lord', 'COMMANDER', 'of', 'the', "NIGHT'S", 'WATCH']
>>>
Update:
>>> import re
>>>
>>> txt = "Jeor MORMONT, Lord COMMANDER of the NIGHT'S WATCH."
>>>
>>> required = ['Jeor', ' ', 'MORMONT', ', ', 'Lord', ' ', 'COMMANDER', ' ', 'of', ' ', 'the', ' ', 'NIGHT\'S', ' ', 'WATCH', '.']
>>>
>>> bag = re.findall(r'[a-zA-Z\'] |[\ ,] |[\.]', txt)
>>>
>>> print(bag)
['Jeor', ' ', 'MORMONT', ', ', 'Lord', ' ', 'COMMANDER', ' ', 'of', ' ', 'the', ' ', "NIGHT'S", ' ', 'WATCH', '.']
>>> print(bag == required)
True
>>>
Comment here if I missed something.
