I generate a dataframe for a doc created from a text by spacy as follow:
test='We walked the walk and still walk it today. Walking brings us great joy.'
tokens=[]
lemma=[]
pos=[]
df=pd.DataFrame()
doc=nlp(test)
for t in doc:
tokens.append(t.text)
lemma.append(t.lemma_)
pos.append(t.pos_)
df['tokens']=tokens
df['lemma']=lemma
df['pos']=pos
df
tokens lemma pos
0 We -PRON- PRON
1 walked walk VERB
2 the the DET
3 walk walk NOUN
4 and and CCONJ
5 still still ADV
6 walk walk VERB
7 it -PRON- PRON
8 today today NOUN
9 . . PUNCT
10 Walking walk VERB
11 brings bring VERB
12 us -PRON- PRON
13 great great ADJ
14 joy joy NOUN
15 . . PUNCT
And I group it by ('lemma', 'pos')
groups_multipe=df.groupby(['lemma','pos'])
I want to find all lemma which own both pos 'VERB' and 'NOUN'. I tried to use .apply() and .fliter(), but I'm not good at it.
For example, lemma 'walk' satisfies the requirement because it has 'VERB' and 'NOUN' in the column 'pos' at the same time.
How can I achieve it
Addition:
Finally, I achieve it in a stupid way: The intersection of sets verb and noun
Here is my code:
lemma_v=set(gm[0][0] for gm in groups_multiple if gm[0][1]=='VERB')
lemma_n=set(gm[0][0] for gm in groups_multiple if gm[0][1]=='NOUN')
lemma_vn=list(lemma_v & lemma_n)
It's so much inefficient, but I do not know any better way. Somebody has idea to improve it ?
CodePudding user response:
Use groupby_transform to create a boolean mask and select right rows:
# custom function to check if 'lemma' is in 'VERB' and 'NOUN'
is_verb_and_noun = lambda x: set(x) == set(['VERB', 'NOUN'])
out = df.loc[df.groupby('lemma')['pos'].transform(is_verb_and_noun), 'lemma']
print(out)
# Output:
1 walk
3 walk
6 walk
10 walk
Name: lemma, dtype: object
Final output:
>>> out.unique().tolist()
['walk']
