Count the frequency of strings found (in any order) in another column and return result in a new col-CodePudding

I'm trying to get the count of words (strings) in a pandas dataframe column that are found in another column in any order.

I've tried the following, which is close, but it doesn't count the occurrences (It only tells me if the words were found in any order).

words='|'.join(df['Cluster Name'].unique())
df['frequency']=df['Keyword'].str.contains(words).astype(int)

Minimum Reproducible Example:

data = {'Keyword' : ['Nike', 'Nike Socks', 'Nike Stripy Socks', 'Socks Nike', 'Adidas Socks'],
        'Cluster' : ['Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks']}

# Create DataFrame
df = pd.DataFrame(data)

expected output

              Keyword     Cluster Frequency
0                Nike  Nike Socks         1
1          Nike Socks  Nike Socks         2
2  Nike Stripy Socks   Nike Socks         2
3          Socks Nike  Nike Socks         2
4        Adidas Socks  Nike Socks         1

CodePudding user response：

You can create a custom function that takes a row as an input, and then apply it to the dataframe rowwise using the argument axis=1:

def count_keywords(row):
    freq = 0
    for word in row['Keyword'].split(" "):
        if word in row['Cluster']:
            freq  = 1
    return freq

df['Frequency'] = df.apply(lambda row: count_keywords(row), axis=1)

Output:

>>> df
             Keyword     Cluster  Frequency
0               Nike  Nike Socks          1
1         Nike Socks  Nike Socks          2
2  Nike Stripy Socks  Nike Socks          2
3         Socks Nike  Nike Socks          2
4       Adidas Socks  Nike Socks          1

CodePudding user response：

My answer is similar to @Derek, but it will work properly if the words in Cluster column are separated not only with whitespases

from re import findall
import pandas as pd

def count_corresponding(row):
    keywords = row.Keyword.split(' ')
    count = sum([len(findall(keyword,row.Cluster)) for keyword in keywords])
    return count


data = {'Keyword' : ['Nike', 'Nike Socks', 'Nike Stripy Socks', 'Socks Nike', 'Adidas Socks'],
        'Cluster' : ['Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks']}

df = pd.DataFrame(data)
df['Frequency'] = df.apply(count_corresponding, axis=1)

CodePudding user response：

We can explode then count the occurrence for word , then sum the back

x = df.assign(Keyword = df.Keyword.str.split(' ')).explode('Keyword')
df['freq'] = x.apply(lambda y : y['Keyword'] in y['Cluster'],axis=1).groupby(level=0).sum()
df
             Keyword     Cluster  freq
0               Nike  Nike Socks     1
1         Nike Socks  Nike Socks     2
2  Nike Stripy Socks  Nike Socks     2
3         Socks Nike  Nike Socks     2
4       Adidas Socks  Nike Socks     1

CodePudding user response：

You're going to have to use apply anyway, so you can directly use set intersection:

df['Frequency'] = df.apply(lambda x: len(set(x['Keyword'].split()).intersection(x['Cluster'].split())), axis=1)

Output:

             Keyword     Cluster  Frequency
0               Nike  Nike Socks          1
1         Nike Socks  Nike Socks          2
2  Nike Stripy Socks  Nike Socks          2
3         Socks Nike  Nike Socks          2
4       Adidas Socks  Nike Socks          1