I have a df that looks like this:
id textcol1 textcol2 ... coln
1 blue bowl green bowl ... xxx
2 purple sheet green grass ... xxx
3 ground black pepper ground black pepper ... xxx
and so on...
I want to get the percentage of common words between textcol1 and textcol2
id textcol1 textcol2 ... coln intersection
1 blue bowl green bowl ... xxx 50
2 purple sheet green grass ... xxx 0
3 ground black pepper ground black pepper ... xxx 100
After an embarrassingly long time I've come up with the following solution
df['intersection'] = [(len(set(a) & set(b)) / float(len(set(a) | set(b))) * 100) for a, b in zip(df.textcol1, df.textcol2)]
But the results are not what I would expect, for example passing "ground black pepper" twice yields 93.33333333333330.
I've gone through all the usual cleaning steps - removing whitespace, etc. - but can't figure out what the issue is here.
What am I missing?
CodePudding user response:
Consider writing a generic text comparison function first, something like text_diff below that computes a simple token overlap between two texts (aka sets of tokens):
def text_diff(text1, text2):
return 100 * len(text1.intersection(text2)) / min(map(len, (text1, text2)))
Then you can get the two columns you want to compare and turn them into sets of tokens, e.g.,
df2 = df.filter(like="textcol").applymap(str.split).applymap(set)
Result:
textcol1 textcol2
id
1 {bowl, blue} {bowl, green}
2 {sheet, purple} {grass, green}
3 {pepper, black, ground} {pepper, black, ground}
So you can easily apply the function by doing
>>> df2.apply(lambda row: text_diff(*row), axis=1)
id
1 50.0
2 0.0
3 100.0
dtype: float64
That way you can easily tweak and/or replace your text_diff function. Do some research on text similarity measures, too, and use existing tools if applicable. fuzzywuzzy could be worth a shot, too.
CodePudding user response:
Here's a quick and dirty way. but might need to be adjusted based on the text, and how you define an interestion to Not a robots points.
def intersections(x):
combined = x['textcol1'].split(' ') x['textcol2'].split(' ')
total = {i:combined.count(i) for i in combined}
return sum([v for v in total.values() if v != 1]) / len(combined) * 100
df['intersections'] = df.apply(intersections, axis=1)
print(df)
textcol1 textcol2 intersections
0 blue bowl green bowl 50.0
1 purple sheet green grass 0.0
2 ground black pepper ground black pepper 100.0
CodePudding user response:
I think the other answers are good, but you want to get the percentage of common words between an row of textcol1 and textcol2.
To obtain this we have to retrieve all tokens from row and count all occurrences between the word tokens in the row of textcol1 and textcol2.
The percentage of common words in the first row must be 0.33, because we compare against a the set words = {bowl, blue, green}.
textcol1 and textcol2 got only one word in common, common_words : {bowl}
As a result we get: #common_words / #all_words = 1 / 3 = 0.33
An example:
from functools import reduce
from operator import add
def fun(text1, text2):
text1_tokens = text1.split(' ')
text2_tokens = text2.split(' ')
text1_set = set(text1_tokens)
text2_set = set(text2_tokens)
text_intersect = list(set.intersection(text1_set, text2_set))
all_tokens = list(set.union(text1_set, text2_set))
common_token_count = list(map(lambda x: all_tokens.count(x), text_intersect))
if len(common_token_count) > 0:
common_token_count = reduce(add, common_token_count)
return f"{common_token_count/len(all_tokens):.2f}"
else:
return 0.00
df["intersection"] = df.apply(lambda x: fun(x["text1"], x["text2"]), axis=1)
The output:
0 blue bowl green bowl 0.33
1 purple sheet green grass 0.00
2 ground black pepper ground black pepper 1.00
