I want to remove periods from degrees like B.S. or Ph.D as well as titles like Mr. or Mrs., so they all have the same format. However, I want to keep the remaining punctuation as is. How can I achieve this without removing all punctuation? Here is an example of what I mean:
Original:
"Dr. Samuels received her B.S. and Ph.D in Industrial and Organizational Psychology.
She also serves on the editorial board for the Journal of Managerial Psychology"
Processed:
"Dr Samuels received her BS and PhD in Industrial and Organizational Psychology.
She also serves on the editorial board for the Journal of Managerial Psychology"
CodePudding user response:
You need to do a regex replacement for the specific instances that require altered punctuation.
import re
string = "Dr. Samuels received her B.S. and Ph.D in Industrial and " \
"Organizational Psychology. She also serves on the editorial board for the Journal of Managerial Psychology"
pattern = {
'B.S.': 'BS',
'Ph.D': 'PhD'
}
def remove_punctuation(dict, text):
pattern = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
return pattern.sub(lambda x: dict[x.string[x.start():x.end()]], text)
CodePudding user response:
For a general solution you will need some system that can deal with natural language. For that I highly recommend some existing solution.
With the sentence tokenizer of nltk:
from nltk.tokenize import sent_tokenize, word_tokenize
import re
from string import punctuation
def remove_dots(word):
if re.match('^[A-Za-z][A-Za-z\.] ', word):
return word.replace('.', '')
return word
text = 'Your input here.'
output = ''.join(('' if word in punctuation else ' ') remove_dots(word)
for sentence in sent_tokenize(text)
for word in word_tokenize(sentence)).lstrip()
Depending on your data this might not catch every instance you want because NLP is a complex matter and such problems require fine-tuning. However, this should give you a good idea of how to start.
CodePudding user response:
text = "Dr. Samuels received her B.S. and Ph.D in Industrial and Organizational Psychology. She also serves on the editorial board for the Journal of Managerial Psychology"
textstrp = text.split()
upd_text = ""
for k, word in enumerate(textstrp):
for i, sym in enumerate(word):
if sym != '.':
upd_text = sym
if sym == '.' and textstrp[k][-1] == "." and len(textstrp[k]) > 3:
if textstrp[k 1][0] == textstrp[k 1][0].upper():
upd_text = '.'
print(i)
if sym == word[-1] and i == len(word)-1:
upd_text = " "
upd_text = upd_text[:-1]
upd_text = '.'
print(upd_text)
