I am using NLTK lib in python to break down each word into tagged elements (i.e. ('London', ''NNP)). However, I cannot figure out how to take this list, and capitalise locations if they are lower case. This is important because london is no longer an 'NNP' and some other locations even become verbs. If anyone knows how to do this efficiently, that would be amazing!
Here is my code:
# returns nature of question with appropriate response text
def chunk_target(self, text, extract_targets):
custom_sent_tokenizer = PunktSentenceTokenizer(text)
tokenized = custom_sent_tokenizer.tokenize(text)
stack = []
for chunk_grammer in extract_targets:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
new = []
# This is where i'm trying to turn valid locations into NNP (capitalise)
for w in tagged:
print(w[0])
for line in self.stations:
if w[0].title() in line.split() and len(w[0]) > 2 and w[0].title() not in new:
new.append(w[0].title())
w = w[0].title()
print(new)
print(tagged)
chunkGram = chunk_grammer
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
stack.append(subtree)
if stack != []:
return stack[0]
return None
CodePudding user response:
What you're looking for is Named Entity Recognition (NER). NLTK does support a named entity function: ne_chunk, which can be used for this purpose. I'll give a demonstration:
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
# Tokenize str -> List[str]
tok_sent = word_tokenize(sentence)
# Tag List[str] -> List[Tuple[str, str]]
pos_sent = pos_tag(tok_sent)
print(pos_sent)
# Chunk this tagged data
tree_sent = ne_chunk(pos_sent)
# This returns a Tree, which we pretty-print
tree_sent.pprint()
locations = []
# All subtrees at height 2 will be our named entities
for named_entity in tree_sent.subtrees(lambda t: t.height() == 2):
# Extract named entity type and the chunk
ne_type = named_entity.label()
chunk = " ".join([tagged[0] for tagged in named_entity.leaves()])
print(ne_type, chunk)
if ne_type == "GPE":
locations.append(chunk)
print(locations)
This outputs (with my comments added):
# pos_tag output:
[('In', 'IN'), ('the', 'DT'), ('wake', 'NN'), ('of', 'IN'), ('a', 'DT'), ('string', 'NN'), ('of', 'IN'), ('abuses', 'NNS'), ('by', 'IN'), ('New', 'NNP'), ('York', 'NNP'), ('police', 'NN'), ('officers', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('1990s', 'CD'), (',', ','), ('Loretta', 'NNP'), ('E.', 'NNP'), ('Lynch', 'NNP'), (',', ','), ('the', 'DT'), ('top', 'JJ'), ('federal', 'JJ'), ('prosecutor', 'NN'), ('in', 'IN'), ('Brooklyn', 'NNP'), (',', ','), ('spoke', 'VBD'), ('forcefully', 'RB'), ('about', 'IN'), ('the', 'DT'), ('pain', 'NN'), ('of', 'IN'), ('a', 'DT'), ('broken', 'JJ'), ('trust', 'NN'), ('that', 'IN'), ('African-Americans', 'NNP'), ('felt', 'VBD'), ('and', 'CC'), ('said', 'VBD'), ('the', 'DT'), ('responsibility', 'NN'), ('for', 'IN'), ('repairing', 'VBG'), ('generations', 'NNS'), ('of', 'IN'), ('miscommunication', 'NN'), ('and', 'CC'), ('mistrust', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('law', 'NN'), ('enforcement', 'NN'), ('.', '.')]
# ne_chunk output:
(S
In/IN
the/DT
wake/NN
of/IN
a/DT
string/NN
of/IN
abuses/NNS
by/IN
(GPE New/NNP York/NNP)
police/NN
officers/NNS
in/IN
the/DT
1990s/CD
,/,
(PERSON Loretta/NNP E./NNP Lynch/NNP)
,/,
the/DT
top/JJ
federal/JJ
prosecutor/NN
in/IN
(GPE Brooklyn/NNP)
,/,
spoke/VBD
forcefully/RB
about/IN
the/DT
pain/NN
of/IN
a/DT
broken/JJ
trust/NN
that/IN
African-Americans/NNP
felt/VBD
and/CC
said/VBD
the/DT
responsibility/NN
for/IN
repairing/VBG
generations/NNS
of/IN
miscommunication/NN
and/CC
mistrust/NN
fell/VBD
to/TO
law/NN
enforcement/NN
./.)
# All entities found
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
# All GPE (Geo-Political Entity)
['New York', 'Brooklyn']
However, it should be noted that the performance of this ne_chunk seems to fall significantly if we remove all capitalisation from the sentence.
We can perform similar stuff with spaCy:
import spacy
import en_core_web_sm
from pprint import pprint
sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nlp = en_core_web_sm.load()
doc = nlp(sentence)
pprint([(X.text, X.label_) for X in doc.ents])
# Then, we can take only `GPE`:
print([X.text for X in doc.ents if X.label_ == "GPE"])
Which outputs:
[('New York', 'GPE'),
('the 1990s', 'DATE'),
('Loretta E. Lynch', 'PERSON'),
('Brooklyn', 'GPE'),
('African-Americans', 'NORP')]
['New York', 'Brooklyn']
This output (for GPE's) is identical to NLTK's, but the reason I mention spaCy is because unlike NLTK, it also works on fully lower-case sentences. If I lower-case my test sentence, then the output becomes:
[('new york', 'GPE'),
('the 1990s', 'DATE'),
('loretta e. lynch', 'PERSON'),
('brooklyn', 'GPE'),
('african-americans', 'NORP')]
['new york', 'brooklyn']
This allows you to title-case these words in an otherwise lower-case sentence.
