I have the following dataframe
id text
0 100 Donut/cookie, ; penguin
1 101 donut cake! penguinz
2 102 pizza!!!? donut cakepeanut
----------------------------------
data = [[100,'Donut/cookie, ; penguin'],
[101,'donut cake! penguinz'],
[102,'pizza!!!? donut cakepeanut']
]
df = pd.DataFrame(data, columns = ['id','text'])
I would to add several columns to my df depending on whether specific substrings exist in the text column
So something like this:
id text donut cookie penguin pizza
0 100 Donut/cookie, ; penguin yes yes yes no
1 101 donut cake! penguinz yes no yes no
2 102 pizza!!!? donut cakepenguin yes no yes yes
I just need it yes/no if the substring exists, delimitation and white spaces don't really matter. Also it would be really helpful if it wasn't case sensitive
CodePudding user response:
You can create the list of key word you interested on, then do str.contains
keylist = ['donut', 'cookie', 'penguin', 'pizza']
for x in keylist:
df[x] = df.text.str.contains(x,case=False).map({True:'yes',False:'no'})
df
Out[379]:
id text donut cookie penguin pizza
0 100 Donut/cookie, ; penguin yes yes yes no
1 101 donut cake! penguinz yes no yes no
2 102 pizza!!!? donut cakepeanut yes no no yes
CodePudding user response:
You could use a regex for that and pandas.get_dummies:
words = ['donut', 'cookie', 'penguin', 'pizza']
regex = '(%s)' % '|'.join(words)
import re
df2 = (pd.get_dummies(df['text'].str.extractall(regex, flags=re.I)[0])
.groupby(level=0).sum()
.replace({1:'yes', 0:'no'})
)
It should be output:
id text cookie donut penguin pizza
0 100 Donut/cookie, ; penguin yes no yes no
1 101 donut cake! penguinz no yes yes no
2 102 pizza!!!? donut cakepeanut no yes no yes
CodePudding user response:
import pandas as pd
data = [[100, 'Donut/cookie, ; penguin'],
[101, 'donut cake! penguinz'],
[102, 'pizza!!!? donut cakepeanut']
]
df = pd.DataFrame(data, columns=['id', 'text'])
def check(df, key):
return df['text'].apply(lambda x: "yes" if key in x.lower() else "no")
df['donut'] = check(df, 'donut')
df['cookie'] = check(df, 'cookie')
df['penguin'] = check(df, 'penguin')
df['pizza'] = check(df, 'pizza')
