Match words in a dataframe column from a list-CodePudding

I have a list
a = ['apples', 'bananas', 'oranges', 'grapes']

And a data frame with a column of phrases

b	c
there are 5 apples	there are 5
here are 3 pears	here are 3 pears
i want 2 grapes	i want 2

I would like to have another column in my data frame which removes the words from list a (ex in the data frame column c). They need to be an exact match.

After searching up some regex I came up with this but it doesn't seem to work properly.

regex = re.compile('|'.join(re.escape(x) for x in a), re.IGNORECASE)

removed = []
for i in df['b']:
    words = re.findall(regex, str(i))
    removed.append(words)

df['c']=removed
df

Also got this error: unbalanced parenthesis at position

CodePudding user response：

You do not actually need any regex, since these are exact matches.

You can do something like this:

import pandas as pd
a = ['apples', 'bananas', 'oranges', 'grapes']

df = pd.DataFrame({'b': ['there are 5 apples', 'here are 5 pears', 'I want 2 grapes']})
# for each row in `b` remove all words that are in `a`
df['c'] = df['b'].apply(lambda x: ' '.join([word for word in x.split() if word not in a]))


    b   c
0   there are 5 apples  there are 5
1   here are 5 pears    here are 5 pears
2   I want 2 grapes I want 2

CodePudding user response：

Use str.replace:

I slightly modified your regex:

regex = re.compile(fr"\s*({'|'.join(re.escape(x) for x in a)})", re.IGNORECASE)

df['c'] = df['b'].str.replace(regex, '')
print(df)

# Output
                    b                 c
0  there are 5 apples       there are 5
1    here are 3 pears  here are 3 pears
2     i want 2 grapes          i want 2

CodePudding user response：

You could use reduce:

import re
from functools import reduce

a = ['apples', 'bananas', 'oranges', 'grapes']

sentences = ["there are 5 apples", "here are 3 pears", "i want 2 grapes"]

print([reduce(lambda x, p: re.sub(p, "", x), a, sentence).strip() for sentence in sentences])

OUTPUT

['there are 5', 'here are 3 pears', 'i want 2']

CodePudding user response：

You can convert the column b into a list of words, use explode and groupby to only keep the ones not in a, and join everything back

Code could be:

# split column b into lists and explode it
words = df['b'].str.split().explode()
# remove words contained in a list
words = words[~ words.isin(a)]

# join everything back
df['c'] = words.groupby(level=0).agg(list).transform(' '.join)