How to get intersection of two text columns in pandas df-CodePudding

I have a df that looks like this:

id  textcol1             textcol2             ... coln
1   blue bowl            green bowl           ... xxx
2   purple sheet         green grass          ... xxx
3   ground black pepper  ground black pepper  ... xxx

and so on...

I want to get the percentage of common words between textcol1 and textcol2

id  textcol1             textcol2             ... coln intersection
1   blue bowl            green bowl           ... xxx  50
2   purple sheet         green grass          ... xxx  0
3   ground black pepper  ground black pepper  ... xxx  100

After an embarrassingly long time I've come up with the following solution

df['intersection'] = [(len(set(a) & set(b)) / float(len(set(a) | set(b))) * 100) for a, b in zip(df.textcol1, df.textcol2)]

But the results are not what I would expect, for example passing "ground black pepper" twice yields 93.33333333333330.

I've gone through all the usual cleaning steps - removing whitespace, etc. - but can't figure out what the issue is here.

What am I missing?

CodePudding user response：

Consider writing a generic text comparison function first, something like text_diff below that computes a simple token overlap between two texts (aka sets of tokens):

def text_diff(text1, text2):
    return 100 * len(text1.intersection(text2)) / min(map(len, (text1, text2)))

Then you can get the two columns you want to compare and turn them into sets of tokens, e.g.,

df2 = df.filter(like="textcol").applymap(str.split).applymap(set)

Result:

                   textcol1                 textcol2
id                                                  
1              {bowl, blue}            {bowl, green}
2           {sheet, purple}           {grass, green}
3   {pepper, black, ground}  {pepper, black, ground}

So you can easily apply the function by doing

>>> df2.apply(lambda row: text_diff(*row), axis=1)
id
1     50.0
2      0.0
3    100.0
dtype: float64

That way you can easily tweak and/or replace your text_diff function. Do some research on text similarity measures, too, and use existing tools if applicable. fuzzywuzzy could be worth a shot, too.

CodePudding user response：

Here's a quick and dirty way. but might need to be adjusted based on the text, and how you define an interestion to Not a robots points.



def intersections(x):
    combined = x['textcol1'].split(' ')   x['textcol2'].split(' ') 
    total = {i:combined.count(i) for i in combined}
    return sum([v for v in total.values() if v != 1]) / len(combined) * 100

df['intersections'] = df.apply(intersections, axis=1)
print(df)

              textcol1             textcol2  intersections
0            blue bowl           green bowl           50.0
1         purple sheet          green grass            0.0
2  ground black pepper  ground black pepper          100.0

CodePudding user response：

I think the other answers are good, but you want to get the percentage of common words between an row of textcol1 and textcol2.

To obtain this we have to retrieve all tokens from row and count all occurrences between the word tokens in the row of textcol1 and textcol2.

The percentage of common words in the first row must be 0.33, because we compare against a the set words = {bowl, blue, green}. textcol1 and textcol2 got only one word in common, common_words : {bowl}

As a result we get: #common_words / #all_words = 1 / 3 = 0.33

An example:

from functools import reduce
from operator import add


def fun(text1, text2):
    text1_tokens = text1.split(' ')
    text2_tokens = text2.split(' ')
    text1_set = set(text1_tokens)
    text2_set = set(text2_tokens)

    text_intersect = list(set.intersection(text1_set, text2_set))
    all_tokens = list(set.union(text1_set, text2_set))
    common_token_count =  list(map(lambda x: all_tokens.count(x), text_intersect))

    if len(common_token_count) > 0:
        common_token_count = reduce(add, common_token_count)
        return f"{common_token_count/len(all_tokens):.2f}"
    else:
        return 0.00
    

df["intersection"] = df.apply(lambda x: fun(x["text1"], x["text2"]), axis=1)

The output:

0   blue bowl   green bowl  0.33
1   purple sheet    green grass 0.00
2   ground black pepper ground black pepper 1.00