I am preprocessing text for SpaCy and trying to remove punctuation, except for between digits. However, in some cases when punctuation is concatenated with digits it's not removed. Could you please suggest how to deal with this edge case?
An example:
import re
text = "Fast-charge $ EV ! battery maker StoreDot pulls in $80.7M led led by Vietnam’s VinFast"
preprocessed = re.sub(r'(?<!\d)[%$!.,,;:’“”—-](?!\d)',' ', text)
print(preprocessed)
# Fast charge EV battery maker StoreDot pulls in $80.7M led by Vietnam s VinFast
Expected result:
# Fast charge EV battery maker StoreDot pulls in 80.7M led by Vietnam s VinFast
CodePudding user response:
The negative lookarounds should be true at both sides, so that will not match the $ in $80
You can match one character of the character class asserting that either on the left or on the right side using an alternation | that there is no digit.
(?<!\d)[%$!.,;:’“”—-]|[%$!.,;:’“”—-](?!\d)
See a regex demo
Notes
There is
led ledin the example string, and a singleledin the expected result, but I assume that is a typo, because the character class can not matchledThere is also a double entry for the
,in the character class.Not sure if you want to keep them, but using a space in the replacement can leave double spaced gaps as you can see in the expected result. If you want to remove them, you can use
strip()to remove the leading and trailing spaces, and use sub withr"[^\S\n]{2,}"to match 2 or more spaces without newline chars and replace them with a single space
