Home > Blockchain >  How Replace a dot (.) in sentence except when it appears in an abbreviation using regular Expression
How Replace a dot (.) in sentence except when it appears in an abbreviation using regular Expression

Time:01-05

I want to replace every dot with a space in a sentence except when it is used with an abbreviation. When it is used with an abbreviation, I want to replace it with '' NULL.

Abbreviation means a dot surrounded at least two Capital letters.

My regex are working except they catch U.S.

r1 = r'\b((?:[A-Z]\.){2,})\s*'
r2 = r'(?:[A-Z]\.){2,}'

'U.S.A is abbr  x.y  is not. But I.I.T. is also valid ABBVR and so is M.Tech'

should become

'USA is abbr  x y  is not But IIT is also valid ABBVR and so is MTech'

CodePudding user response:

You can use

import re
s='U.S.A is abbr  x.y  is not. But I.I.T. is also valid ABBVR and so is M.Tech'
print(re.sub(r'\b((?:[A-Z]\.) )\.?|\.', lambda x: x.group(1).replace('.', '') if x.group(1) else ' ', s))
# => USA is abbr  x y  is not  But IIT is also valid ABBVR and so is MTech

See the Python demo. Here is a regex demo. It matches

  • \b((?:[A-Z]\.) )\.? - a word boundary, then Group 1 capturing one or more occurrences of an uppercase ASCII letter and a ., and then an optional dot (if an abbreviation ends with a dot)
  • | - or
  • \. - a dot (in any other context)

If Group 1 matches, the replacement is Group 1 value with all dots removed with .replace('.', ''), else, the replacement is a space.

To make it Unicode-aware, install PyPi regex library (pip install regex) and use

import regex
s='U.S.A is abbr  x.y  is not. But I.I.T. is also valid ABBVR and so is M.Tech'
print(regex.sub(r'\b((?:\p{Lu}\.) )\.?|\.', lambda x: x.group(1).replace('.', '') if x.group(1) else ' ', s))

The \p{Lu} matches any Unicode uppercase letter.

  •  Tags:  
  • Related