How to Capitalize Locations in a List Python-CodePudding

I am using NLTK lib in python to break down each word into tagged elements (i.e. ('London', ''NNP)). However, I cannot figure out how to take this list, and capitalise locations if they are lower case. This is important because london is no longer an 'NNP' and some other locations even become verbs. If anyone knows how to do this efficiently, that would be amazing!

Here is my code:

# returns nature of question with appropriate response text
def chunk_target(self, text, extract_targets):
    
    custom_sent_tokenizer = PunktSentenceTokenizer(text)
    tokenized = custom_sent_tokenizer.tokenize(text)
    stack = []
    for chunk_grammer in extract_targets:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            new = []

            # This is where i'm trying to turn valid locations into NNP (capitalise)
            for w in tagged:
                print(w[0])
                for line in self.stations:
                    if w[0].title() in line.split() and len(w[0]) > 2 and w[0].title() not in new:
                        new.append(w[0].title())
                        w = w[0].title()

            print(new)              
            print(tagged)

            chunkGram = chunk_grammer
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                stack.append(subtree)
        if stack != []:
            return stack[0]
    return None

CodePudding user response：

What you're looking for is Named Entity Recognition (NER). NLTK does support a named entity function: ne_chunk, which can be used for this purpose. I'll give a demonstration:

from nltk import word_tokenize, pos_tag, ne_chunk

sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

# Tokenize str -> List[str]
tok_sent = word_tokenize(sentence)
# Tag List[str] -> List[Tuple[str, str]]
pos_sent = pos_tag(tok_sent)
print(pos_sent)
# Chunk this tagged data
tree_sent = ne_chunk(pos_sent)
# This returns a Tree, which we pretty-print
tree_sent.pprint()

locations = []
# All subtrees at height 2 will be our named entities
for named_entity in tree_sent.subtrees(lambda t: t.height() == 2):
    # Extract named entity type and the chunk
    ne_type = named_entity.label()
    chunk = " ".join([tagged[0] for tagged in named_entity.leaves()])
    print(ne_type, chunk)
    if ne_type == "GPE":
        locations.append(chunk)

print(locations)

This outputs (with my comments added):

# pos_tag output:
[('In', 'IN'), ('the', 'DT'), ('wake', 'NN'), ('of', 'IN'), ('a', 'DT'), ('string', 'NN'), ('of', 'IN'), ('abuses', 'NNS'), ('by', 'IN'), ('New', 'NNP'), ('York', 'NNP'), ('police', 'NN'), ('officers', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('1990s', 'CD'), (',', ','), ('Loretta', 'NNP'), ('E.', 'NNP'), ('Lynch', 'NNP'), (',', ','), ('the', 'DT'), ('top', 'JJ'), ('federal', 'JJ'), ('prosecutor', 'NN'), ('in', 'IN'), ('Brooklyn', 'NNP'), (',', ','), ('spoke', 'VBD'), ('forcefully', 'RB'), ('about', 'IN'), ('the', 'DT'), ('pain', 'NN'), ('of', 'IN'), ('a', 'DT'), ('broken', 'JJ'), ('trust', 'NN'), ('that', 'IN'), ('African-Americans', 'NNP'), ('felt', 'VBD'), ('and', 'CC'), ('said', 'VBD'), ('the', 'DT'), ('responsibility', 'NN'), ('for', 'IN'), ('repairing', 'VBG'), ('generations', 'NNS'), ('of', 'IN'), ('miscommunication', 'NN'), ('and', 'CC'), ('mistrust', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('law', 'NN'), ('enforcement', 'NN'), ('.', '.')]
# ne_chunk output:
(S
  In/IN
  the/DT
  wake/NN
  of/IN
  a/DT
  string/NN
  of/IN
  abuses/NNS
  by/IN
  (GPE New/NNP York/NNP)
  police/NN
  officers/NNS
  in/IN
  the/DT
  1990s/CD
  ,/,
  (PERSON Loretta/NNP E./NNP Lynch/NNP)
  ,/,
  the/DT
  top/JJ
  federal/JJ
  prosecutor/NN
  in/IN
  (GPE Brooklyn/NNP)
  ,/,
  spoke/VBD
  forcefully/RB
  about/IN
  the/DT
  pain/NN
  of/IN
  a/DT
  broken/JJ
  trust/NN
  that/IN
  African-Americans/NNP
  felt/VBD
  and/CC
  said/VBD
  the/DT
  responsibility/NN
  for/IN
  repairing/VBG
  generations/NNS
  of/IN
  miscommunication/NN
  and/CC
  mistrust/NN
  fell/VBD
  to/TO
  law/NN
  enforcement/NN
  ./.)
# All entities found
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
# All GPE (Geo-Political Entity)
['New York', 'Brooklyn']

However, it should be noted that the performance of this ne_chunk seems to fall significantly if we remove all capitalisation from the sentence.

We can perform similar stuff with spaCy:

import spacy
import en_core_web_sm
from pprint import pprint

sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nlp = en_core_web_sm.load()

doc = nlp(sentence)
pprint([(X.text, X.label_) for X in doc.ents])
# Then, we can take only `GPE`:
print([X.text for X in doc.ents if X.label_ == "GPE"])

Which outputs:

[('New York', 'GPE'),
 ('the 1990s', 'DATE'),
 ('Loretta E. Lynch', 'PERSON'),
 ('Brooklyn', 'GPE'),
 ('African-Americans', 'NORP')]
['New York', 'Brooklyn']

This output (for GPE's) is identical to NLTK's, but the reason I mention spaCy is because unlike NLTK, it also works on fully lower-case sentences. If I lower-case my test sentence, then the output becomes:

[('new york', 'GPE'),
 ('the 1990s', 'DATE'),
 ('loretta e. lynch', 'PERSON'),
 ('brooklyn', 'GPE'),
 ('african-americans', 'NORP')]
['new york', 'brooklyn']

This allows you to title-case these words in an otherwise lower-case sentence.