Assigning True/False if a token is present in a data-frame-CodePudding

My current data-frame is:

     |articleID | keywords                                               | 
     |:-------- |:------------------------------------------------------:| 
0    |58b61d1d  | ['Second Avenue (Manhattan, NY)']                      |     
1    |58b6393b  | ['Crossword Puzzles']                                  |          
2    |58b6556e  | ['Workplace Hazards and Violations', 'Trump, Donald J']|            
3    |58b657fa  | ['Trump, Donald J', 'Speeches and Statements'].        |

I want a data-frame similar to the following, where a column is added based on whether a Trump token, 'Trump, Donald J' is mentioned in the keywords and if so then it is assigned True :

     |articleID | keywords                                               | trumpMention |
     |:-------- |:------------------------------------------------------:| ------------:|
0    |58b61d1d  | ['Second Avenue (Manhattan, NY)']                      | False        |      
1    |58b6393b  | ['Crossword Puzzles']                                  | False        |          
2    |58b6556e  | ['Workplace Hazards and Violations', 'Trump, Donald J']| True         |           
3    |58b657fa  | ['Trump, Donald J', 'Speeches and Statements'].        | True         |

I have tried multiple ways using df functions. But cannot achieve my wanted results. Some of the ways I've tried are:

df['trumpMention'] = np.where(any(df['keywords']) == 'Trump, Donald J', True, False)

df['trumpMention'] = df['keywords'].apply(lambda x: any(token == 'Trump, Donald J') for token in x)

lst = ['Trump, Donald J']  
df['trumpMention'] = df['keywords'].apply(lambda x: ([ True for token in x if any(token in lst)]))

Raw input:

df = pd.DataFrame({'articleID': ['58b61d1d', '58b6393b', '58b6556e', '58b657fa'],
                   'keywords': [['Second Avenue (Manhattan, NY)'],
                                ['Crossword Puzzles'],
                                ['Workplace Hazards and Violations', 'Trump, Donald J'],
                                ['Trump, Donald J', 'Speeches and Statements']],
                   'trumpMention': [False, False, True, True]})

CodePudding user response：

try

df["trumpMention"] = df["keywords"].apply(lambda x: "Trump, Donald J" in x)

CodePudding user response：

How about applying a function that checks set membership?

df['trumpMention'] = df['keywords'].apply(lambda x: 'Trump, Donald J' in set(x))

Output:

  articleID                                           keywords  trumpMention
0  58b61d1d                    [Second Avenue (Manhattan, NY)]         False
1  58b6393b                                [Crossword Puzzles]         False
2  58b6556e  [Workplace Hazards and Violations, Trump, Dona...          True
3  58b657fa         [Trump, Donald J, Speeches and Statements]          True

As to your attempts:

np.where(any(df['keywords']) == 'Trump, Donald J', True, False)

wouldn't work because any(df['keywords']) would always evaluate True which isn't equal to 'Trump, Donald J', so the above will always return array(False).

df['keywords'].apply(lambda x: any(token == 'Trump, Donald J') for token in x)

doesn't work because it raises TypeError since there is no comprehension here.

df['keywords'].apply(lambda x: ([ True for token in x if any(token in lst)]))

doesn't work because token in lst is a boolean value, so

any(token in lst)

is nonsensical.

CodePudding user response：

Use a vectorized approach, it will be faster than using apply.

df.keywords.astype(str).str.contains("Trump, Donald J")

CodePudding user response：

Try my way. I create a list before adding it to dataframe.

def mentioned_Trump(s, lst):
    if s in lst:
        return True
    else:
        return False
s = [[1,['Second Avenue (Manhattan, NY)']],[2,['Crossword Puzzles']],
    [3, ['Workplace Hazards and Violations', 'Trump, Donald J']],
    [4, ['Trump, Donald J', 'Speeches and Statements']]]

import pandas as pd
df = pd.DataFrame(s)
df.columns =['ID','keywords']

s = list( df['keywords'])
s1 = [mentioned_Trump('Trump, Donald J',x) for x in s]

df['trumpMention']= s1 
print(df)