Populate a column in a pandas df based on a selected column containing a certain substring-CodePudding

I have the following dataframe

     id         text
0   100  Donut/cookie, ; penguin
1   101  donut cake! penguinz
2   102  pizza!!!? donut cakepeanut

----------------------------------

data = [[100,'Donut/cookie, ; penguin'],
    [101,'donut cake! penguinz'],
    [102,'pizza!!!? donut cakepeanut']
   ]

df = pd.DataFrame(data, columns = ['id','text'])

I would to add several columns to my df depending on whether specific substrings exist in the text column

So something like this:

    id         text                   donut       cookie      penguin  pizza
0   100  Donut/cookie, ; penguin        yes        yes          yes      no   
1   101  donut cake! penguinz           yes        no           yes      no
2   102  pizza!!!? donut cakepenguin    yes        no           yes      yes

I just need it yes/no if the substring exists, delimitation and white spaces don't really matter. Also it would be really helpful if it wasn't case sensitive

CodePudding user response：

You can create the list of key word you interested on, then do str.contains

keylist = ['donut', 'cookie', 'penguin', 'pizza']
for x in keylist:
    df[x] = df.text.str.contains(x,case=False).map({True:'yes',False:'no'})
    
df
Out[379]: 
    id                        text donut cookie penguin pizza
0  100     Donut/cookie, ; penguin   yes    yes     yes    no
1  101        donut cake! penguinz   yes     no     yes    no
2  102  pizza!!!? donut cakepeanut   yes     no      no   yes

CodePudding user response：

You could use a regex for that and pandas.get_dummies:

words = ['donut', 'cookie', 'penguin', 'pizza']
regex = '(%s)' % '|'.join(words)

import re
df2 = (pd.get_dummies(df['text'].str.extractall(regex, flags=re.I)[0])
       .groupby(level=0).sum()
       .replace({1:'yes', 0:'no'})
      )

It should be output:

    id                        text cookie donut penguin pizza
0  100     Donut/cookie, ; penguin    yes    no     yes    no
1  101        donut cake! penguinz     no   yes     yes    no
2  102  pizza!!!? donut cakepeanut     no   yes      no   yes

CodePudding user response：

import pandas as pd
data = [[100, 'Donut/cookie, ; penguin'],
        [101, 'donut cake! penguinz'],
        [102, 'pizza!!!? donut cakepeanut']
        ]

df = pd.DataFrame(data, columns=['id', 'text'])


def check(df, key):
    return df['text'].apply(lambda x: "yes" if key in x.lower() else "no")


df['donut'] = check(df, 'donut')
df['cookie'] = check(df, 'cookie')
df['penguin'] = check(df, 'penguin')
df['pizza'] = check(df, 'pizza')