Home > Software engineering >  Populate a column in a pandas df based on a selected column containing a certain substring
Populate a column in a pandas df based on a selected column containing a certain substring

Time:01-06

I have the following dataframe

     id         text
0   100  Donut/cookie, ; penguin
1   101  donut cake! penguinz
2   102  pizza!!!? donut cakepeanut

----------------------------------

data = [[100,'Donut/cookie, ; penguin'],
    [101,'donut cake! penguinz'],
    [102,'pizza!!!? donut cakepeanut']
   ]

df = pd.DataFrame(data, columns = ['id','text'])

I would to add several columns to my df depending on whether specific substrings exist in the text column

So something like this:

    id         text                   donut       cookie      penguin  pizza
0   100  Donut/cookie, ; penguin        yes        yes          yes      no   
1   101  donut cake! penguinz           yes        no           yes      no
2   102  pizza!!!? donut cakepenguin    yes        no           yes      yes  

I just need it yes/no if the substring exists, delimitation and white spaces don't really matter. Also it would be really helpful if it wasn't case sensitive

CodePudding user response:

You can create the list of key word you interested on, then do str.contains

keylist = ['donut', 'cookie', 'penguin', 'pizza']
for x in keylist:
    df[x] = df.text.str.contains(x,case=False).map({True:'yes',False:'no'})
    
df
Out[379]: 
    id                        text donut cookie penguin pizza
0  100     Donut/cookie, ; penguin   yes    yes     yes    no
1  101        donut cake! penguinz   yes     no     yes    no
2  102  pizza!!!? donut cakepeanut   yes     no      no   yes

CodePudding user response:

You could use a regex for that and pandas.get_dummies:

words = ['donut', 'cookie', 'penguin', 'pizza']
regex = '(%s)' % '|'.join(words)

import re
df2 = (pd.get_dummies(df['text'].str.extractall(regex, flags=re.I)[0])
       .groupby(level=0).sum()
       .replace({1:'yes', 0:'no'})
      )

It should be output:

    id                        text cookie donut penguin pizza
0  100     Donut/cookie, ; penguin    yes    no     yes    no
1  101        donut cake! penguinz     no   yes     yes    no
2  102  pizza!!!? donut cakepeanut     no   yes      no   yes

CodePudding user response:

import pandas as pd
data = [[100, 'Donut/cookie, ; penguin'],
        [101, 'donut cake! penguinz'],
        [102, 'pizza!!!? donut cakepeanut']
        ]

df = pd.DataFrame(data, columns=['id', 'text'])


def check(df, key):
    return df['text'].apply(lambda x: "yes" if key in x.lower() else "no")


df['donut'] = check(df, 'donut')
df['cookie'] = check(df, 'cookie')
df['penguin'] = check(df, 'penguin')
df['pizza'] = check(df, 'pizza')
  •  Tags:  
  • Related