I have a list
a = ['apples', 'bananas', 'oranges', 'grapes']
And a data frame with a column of phrases
| b | c |
|---|---|
| there are 5 apples | there are 5 |
| here are 3 pears | here are 3 pears |
| i want 2 grapes | i want 2 |
I would like to have another column in my data frame which removes the words from list a (ex in the data frame column c). They need to be an exact match.
After searching up some regex I came up with this but it doesn't seem to work properly.
regex = re.compile('|'.join(re.escape(x) for x in a), re.IGNORECASE)
removed = []
for i in df['b']:
words = re.findall(regex, str(i))
removed.append(words)
df['c']=removed
df
Also got this error: unbalanced parenthesis at position
CodePudding user response:
You do not actually need any regex, since these are exact matches.
You can do something like this:
import pandas as pd
a = ['apples', 'bananas', 'oranges', 'grapes']
df = pd.DataFrame({'b': ['there are 5 apples', 'here are 5 pears', 'I want 2 grapes']})
# for each row in `b` remove all words that are in `a`
df['c'] = df['b'].apply(lambda x: ' '.join([word for word in x.split() if word not in a]))
b c
0 there are 5 apples there are 5
1 here are 5 pears here are 5 pears
2 I want 2 grapes I want 2
CodePudding user response:
Use str.replace:
I slightly modified your regex:
regex = re.compile(fr"\s*({'|'.join(re.escape(x) for x in a)})", re.IGNORECASE)
df['c'] = df['b'].str.replace(regex, '')
print(df)
# Output
b c
0 there are 5 apples there are 5
1 here are 3 pears here are 3 pears
2 i want 2 grapes i want 2
CodePudding user response:
You could use reduce:
import re
from functools import reduce
a = ['apples', 'bananas', 'oranges', 'grapes']
sentences = ["there are 5 apples", "here are 3 pears", "i want 2 grapes"]
print([reduce(lambda x, p: re.sub(p, "", x), a, sentence).strip() for sentence in sentences])
OUTPUT
['there are 5', 'here are 3 pears', 'i want 2']
CodePudding user response:
You can convert the column b into a list of words, use explode and groupby to only keep the ones not in a, and join everything back
Code could be:
# split column b into lists and explode it
words = df['b'].str.split().explode()
# remove words contained in a list
words = words[~ words.isin(a)]
# join everything back
df['c'] = words.groupby(level=0).agg(list).transform(' '.join)
