Home > Net >  How to strip periods from certain words but not the of sentences?
How to strip periods from certain words but not the of sentences?

Time:01-16

I want to remove periods from degrees like B.S. or Ph.D as well as titles like Mr. or Mrs., so they all have the same format. However, I want to keep the remaining punctuation as is. How can I achieve this without removing all punctuation? Here is an example of what I mean:

Original:
"Dr. Samuels received her B.S. and Ph.D in Industrial and Organizational Psychology. 
 She also serves on the editorial board for the Journal of Managerial Psychology"

Processed:
"Dr Samuels received her BS and PhD in Industrial and Organizational Psychology. 
 She also serves on the editorial board for the Journal of Managerial Psychology"

CodePudding user response:

You need to do a regex replacement for the specific instances that require altered punctuation.

import re

string = "Dr. Samuels received her B.S. and Ph.D in Industrial and " \
      "Organizational Psychology. She also serves on the editorial board for the Journal of Managerial Psychology"

pattern = {
    'B.S.': 'BS',
    'Ph.D': 'PhD'
}

def remove_punctuation(dict, text):
  pattern = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
  return pattern.sub(lambda x: dict[x.string[x.start():x.end()]], text)

CodePudding user response:

For a general solution you will need some system that can deal with natural language. For that I highly recommend some existing solution.

With the sentence tokenizer of nltk:

from nltk.tokenize import sent_tokenize, word_tokenize
import re
from string import punctuation

def remove_dots(word): 
    if re.match('^[A-Za-z][A-Za-z\.] ', word): 
        return word.replace('.', '') 
    return word

text = 'Your input here.'

output = ''.join(('' if word in punctuation else ' ')   remove_dots(word)
                 for sentence in sent_tokenize(text)
                 for word in word_tokenize(sentence)).lstrip()

Depending on your data this might not catch every instance you want because NLP is a complex matter and such problems require fine-tuning. However, this should give you a good idea of how to start.

CodePudding user response:

text = "Dr. Samuels received her B.S. and Ph.D in Industrial and Organizational Psychology. She also serves on the editorial board for the Journal of Managerial Psychology"
textstrp = text.split()
upd_text = ""
for k, word in enumerate(textstrp):
    for i, sym in enumerate(word):
        if sym != '.':
            upd_text  = sym
        if sym == '.' and textstrp[k][-1] == "." and len(textstrp[k]) > 3:
            if textstrp[k 1][0] == textstrp[k 1][0].upper():
                upd_text  = '.'
                print(i)
        if sym == word[-1] and i == len(word)-1:
            upd_text  = " "
upd_text = upd_text[:-1]
upd_text  = '.'
print(upd_text)
  •  Tags:  
  • Related