I'm trying to get the count of words (strings) in a pandas dataframe column that are found in another column in any order.
I've tried the following, which is close, but it doesn't count the occurrences (It only tells me if the words were found in any order).
words='|'.join(df['Cluster Name'].unique())
df['frequency']=df['Keyword'].str.contains(words).astype(int)
Minimum Reproducible Example:
data = {'Keyword' : ['Nike', 'Nike Socks', 'Nike Stripy Socks', 'Socks Nike', 'Adidas Socks'],
'Cluster' : ['Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks']}
# Create DataFrame
df = pd.DataFrame(data)
expected output
Keyword Cluster Frequency
0 Nike Nike Socks 1
1 Nike Socks Nike Socks 2
2 Nike Stripy Socks Nike Socks 2
3 Socks Nike Nike Socks 2
4 Adidas Socks Nike Socks 1
CodePudding user response:
You can create a custom function that takes a row as an input, and then apply it to the dataframe rowwise using the argument axis=1:
def count_keywords(row):
freq = 0
for word in row['Keyword'].split(" "):
if word in row['Cluster']:
freq = 1
return freq
df['Frequency'] = df.apply(lambda row: count_keywords(row), axis=1)
Output:
>>> df
Keyword Cluster Frequency
0 Nike Nike Socks 1
1 Nike Socks Nike Socks 2
2 Nike Stripy Socks Nike Socks 2
3 Socks Nike Nike Socks 2
4 Adidas Socks Nike Socks 1
CodePudding user response:
My answer is similar to @Derek, but it will work properly if the words in Cluster column are separated not only with whitespases
from re import findall
import pandas as pd
def count_corresponding(row):
keywords = row.Keyword.split(' ')
count = sum([len(findall(keyword,row.Cluster)) for keyword in keywords])
return count
data = {'Keyword' : ['Nike', 'Nike Socks', 'Nike Stripy Socks', 'Socks Nike', 'Adidas Socks'],
'Cluster' : ['Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks']}
df = pd.DataFrame(data)
df['Frequency'] = df.apply(count_corresponding, axis=1)
CodePudding user response:
We can explode then count the occurrence for word , then sum the back
x = df.assign(Keyword = df.Keyword.str.split(' ')).explode('Keyword')
df['freq'] = x.apply(lambda y : y['Keyword'] in y['Cluster'],axis=1).groupby(level=0).sum()
df
Keyword Cluster freq
0 Nike Nike Socks 1
1 Nike Socks Nike Socks 2
2 Nike Stripy Socks Nike Socks 2
3 Socks Nike Nike Socks 2
4 Adidas Socks Nike Socks 1
CodePudding user response:
You're going to have to use apply anyway, so you can directly use set intersection:
df['Frequency'] = df.apply(lambda x: len(set(x['Keyword'].split()).intersection(x['Cluster'].split())), axis=1)
Output:
Keyword Cluster Frequency
0 Nike Nike Socks 1
1 Nike Socks Nike Socks 2
2 Nike Stripy Socks Nike Socks 2
3 Socks Nike Nike Socks 2
4 Adidas Socks Nike Socks 1
