Home > database >  Regex remove punctuation but not between digits incorrect result
Regex remove punctuation but not between digits incorrect result

Time:01-12

I am preprocessing text for SpaCy and trying to remove punctuation, except for between digits. However, in some cases when punctuation is concatenated with digits it's not removed. Could you please suggest how to deal with this edge case?

An example:

import re

text = "Fast-charge $ EV ! battery maker StoreDot pulls in $80.7M led led by Vietnam’s VinFast"

preprocessed = re.sub(r'(?<!\d)[%$!.,,;:’“”—-](?!\d)',' ', text)

print(preprocessed)

# Fast charge   EV   battery maker StoreDot pulls in $80.7M led by Vietnam s VinFast

Expected result:

# Fast charge   EV   battery maker StoreDot pulls in 80.7M led by Vietnam s VinFast

CodePudding user response:

The negative lookarounds should be true at both sides, so that will not match the $ in $80

You can match one character of the character class asserting that either on the left or on the right side using an alternation | that there is no digit.

(?<!\d)[%$!.,;:’“”—-]|[%$!.,;:’“”—-](?!\d)

See a regex demo

Notes

  • There is led led in the example string, and a single led in the expected result, but I assume that is a typo, because the character class can not match led

  • There is also a double entry for the , in the character class.

  • Not sure if you want to keep them, but using a space in the replacement can leave double spaced gaps as you can see in the expected result. If you want to remove them, you can use strip() to remove the leading and trailing spaces, and use sub with r"[^\S\n]{2,}" to match 2 or more spaces without newline chars and replace them with a single space

  •  Tags:  
  • Related