How to strip periods from certain words but not the of sentences?-CodePudding

I want to remove periods from degrees like B.S. or Ph.D as well as titles like Mr. or Mrs., so they all have the same format. However, I want to keep the remaining punctuation as is. How can I achieve this without removing all punctuation? Here is an example of what I mean:

Original:
"Dr. Samuels received her B.S. and Ph.D in Industrial and Organizational Psychology. 
 She also serves on the editorial board for the Journal of Managerial Psychology"

Processed:
"Dr Samuels received her BS and PhD in Industrial and Organizational Psychology. 
 She also serves on the editorial board for the Journal of Managerial Psychology"

CodePudding user response：

You need to do a regex replacement for the specific instances that require altered punctuation.

import re

string = "Dr. Samuels received her B.S. and Ph.D in Industrial and " \
      "Organizational Psychology. She also serves on the editorial board for the Journal of Managerial Psychology"

pattern = {
    'B.S.': 'BS',
    'Ph.D': 'PhD'
}

def remove_punctuation(dict, text):
  pattern = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
  return pattern.sub(lambda x: dict[x.string[x.start():x.end()]], text)

CodePudding user response：

For a general solution you will need some system that can deal with natural language. For that I highly recommend some existing solution.

With the sentence tokenizer of nltk:

from nltk.tokenize import sent_tokenize, word_tokenize
import re
from string import punctuation

def remove_dots(word): 
    if re.match('^[A-Za-z][A-Za-z\.] ', word): 
        return word.replace('.', '') 
    return word

text = 'Your input here.'

output = ''.join(('' if word in punctuation else ' ')   remove_dots(word)
                 for sentence in sent_tokenize(text)
                 for word in word_tokenize(sentence)).lstrip()

Depending on your data this might not catch every instance you want because NLP is a complex matter and such problems require fine-tuning. However, this should give you a good idea of how to start.

CodePudding user response：

text = "Dr. Samuels received her B.S. and Ph.D in Industrial and Organizational Psychology. She also serves on the editorial board for the Journal of Managerial Psychology"
textstrp = text.split()
upd_text = ""
for k, word in enumerate(textstrp):
    for i, sym in enumerate(word):
        if sym != '.':
            upd_text  = sym
        if sym == '.' and textstrp[k][-1] == "." and len(textstrp[k]) > 3:
            if textstrp[k 1][0] == textstrp[k 1][0].upper():
                upd_text  = '.'
                print(i)
        if sym == word[-1] and i == len(word)-1:
            upd_text  = " "
upd_text = upd_text[:-1]
upd_text  = '.'
print(upd_text)