I want to keep only those words which are present in my list. All other words should get deleted.(pandas dataframe)
cuisine_list = ['breakfast', 'american', 'tea', 'chicken']
| name | cuisine |
|---|---|
| dominos pizza | breakfast american tea dine in |
| kfc | american chicken play area |
The result should look like this-
| name | cuisine |
|---|---|
| dominos pizza | breakfast american tea |
| kfc | american chicken |
I am using following code but its taking lot of time.
file1_cuisine = file1[["Cuisine"]]
for index, row in file1_cuisine.iterrows():
words_to_keep = []
for word in row[0].split(' '):
if word in words_to_match :
words_to_keep.append(word ' ')
file1_cuisine.loc[index, 'final_input_text']= ''.join(words_to_keep)
CodePudding user response:
Use lambda function with split and set intersection, last join values by ,:
cuisine_list = ['breakfast', 'american', 'tea', 'chicken']
df['cuisine'] = df['cuisine'].apply(lambda x: ','.join(set(x.split()).intersection(cuisine_list)))
print (df)
name cuisine
0 dominos pizza tea,breakfast,american
1 kfc chicken,american
Or use Series.str.findall:
cuisine_list = ['breakfast', 'american', 'tea', 'chicken']
pat = '|'.join(r"\b{}\b".format(x) for x in cuisine_list)
df['cuisine'] = df['cuisine'].str.findall(rf'{pat}').str.join(',')
print (df)
name cuisine
0 dominos pizza breakfast,american,tea
1 kfc american,chicken
CodePudding user response:
Use set intersection using & with df.apply and Series.str.split:
In [760]: y = set(cuisine_list)
In [766]: df['cuisine'] = df['cuisine'].str.split().apply(lambda x: list(set(x) & y)).str.join(',')
In [767]: df
Out[767]:
name cuisine
0 dominos pizza tea,american,breakfast
1 kfc chicken,american
