I would like to remove dots from abbreviations in a Pandas dataframe but not if the dots are in between longer words. So 'l.t.d.' and 'ltd.' should result in 'ltd' but 'longword.' should remain the same.
The regex I now have is (?:\b\w{1,3})(\.). From this regex, I want to replace the result in group 1 by an empty string. How can I tell str.replace(r'(?:\b\w{1,3})(\.)', '') to consider only the second group?
CodePudding user response:
You can use
df['col'] = df['col'].str.replace(r'\b([a-zA-Z]{1,3})\.', r'\1', regex=True)
## Or, to account for any Unicode letters:
df['col'] = df['col'].str.replace(r'\b([^\W\d_]{1,3})\.', r'\1', regex=True)
See the regex demo. Details:
\b- word boundary([^\W\d_]{1,3})- Group 1 (\1): one, two or three letters\.- a dot.
The \1 in the replacement refers to the Group 1 value.
Note you should provide the regex=True argument to Series.str.replace to avoid the warning described in FutureWarning: The default value of regex will change from True to False in a future version.
