Input: df1:
Flow Order
Bananas
White Peach
White Nectarine
df2
Subcategory_Description Segment_Description Flow
POTATO POTATOES P/P PP Potato White 4kg
POTATO POTATOES P/P PP Potato White 1.5kg
POTATO POTATOES P/P PP Potato White 2kg
BANANA BANANAS OTHER (INC P/P) 5pk Kids Banana
STONE FRUIT PEACHES P/P PP Peach White
STONE FRUIT NECTARINES P/P PP Nectarine White
MANGOES KENSINGTON PRIDE Mango Kp
BANANA BANANAS CAVENDISH Banana
STONE FRUIT PEACHES LOOSE Peach White
STONE FRUIT NECTARINES LOOSE Nectarine
STONE FRUIT NECTARINES LOOSE Nectarine White
Scenario: Have to take all matching flows from df2, having df1 as base flow. I have used Fuzzywuzzy library from python for doing so, but the problem here is it's considering partial match as well. For example, in expected Output for Flow - 'White Peach', White Potato is considered here (which shouldn't happen).
Python Code till above mentioned scenario is given below:
import pandas as pd
from fuzzywuzzy import process
# converting to pandas dataframes
dframe1 = pd.DataFrame(df1['Flow Order'])
dframe2 = pd.DataFrame(df2['Flow'])
# empty lists for storing the matches later
mat1 = []
mat2 = []
p = []
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
"\nSecond dataframe:\n", dframe2)
# converting dataframe column to list
# of elements
# to do fuzzy matching
list1 = dframe1['Flow Order'].tolist()
list2 = dframe2['Flow'].tolist()
# taking the threshold as 82
threshold = 82
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extract(i, list2, limit=5))
dframe1['matches'] = mat1
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
for k in j:
if k[1] >= threshold:
p.append(k[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches back to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching:")
print("dframe1:", dframe1)
Above Code Output: dframe1
Flow Order matches
Bananas Banana,5pk Kids Banana
White Peach PP Peach White,Peach White,PP Potato White 4kg,PP Potato White 1.5kg,PP Potato White 2kg
White Nectarine PP Nectarine White,Nectarine White,Nectarine,PP Potato White 2kg,PP Potato White 1kg
So the Idea to be included in above code is to consider 'Subcategory_Description' and 'Segment_Description' from df2 as well to get the perfect match flows alone. These two columns can be used to take out main names for matching strings using fuzzywuzzy. But I'm not sure how to include those.
Expected Output:
Flow Order matches
Bananas Banana,5pk Kids Banana
White Peach PP Peach White,Peach White
White Nectarine PP Nectarine White,Nectarine White,Nectarine
Kindly help me out. Thanks in Advance!
CodePudding user response:
You can try using the scorer token_sort_ratio (check out the code), it will attempt to account for similar strings out of order. Combining with your threshold, it should work:
from fuzzywuzzy import fuzz
process.extract('White Peach', ['PP Peach White','Peach White','PP Potato White 4kg',
'PP Potato White 1.5kg','PP Potato White 2kg'],
scorer=fuzz.token_sort_ratio)
which will give
[('Peach White', 100),
('PP Peach White', 88),
('PP Potato White 4kg', 53),
('PP Potato White 2kg', 53),
('PP Potato White 1.5kg', 50)]
CodePudding user response:
You can use thefuzz (the new name of former fuzzywuzzy) with process.extractOne. To improve matching, use the scorer=fuzz.token_sort_ratio and set a high enough cutoff (here: score_cutoff=80). The ideal cutoff should be determined empirically.
I am using df1 and df2 as dataframe names.
from thefuzz import process, fuzz
match = df2['Flow'].apply(lambda x: out[0] if
(out:=process.extractOne(x, df1['Flow Order'],
scorer=fuzz.token_sort_ratio,
score_cutoff=80))
else None)
df2.groupby(match).agg(matches=('Flow', ','.join))
output:
matches
Flow
Bananas Banana
White Nectarine PP Nectarine White,Nectarine White
White Peach PP Peach White,Peach White
To help determine an ideal cutoff, you could run:
df2.join(df2['Flow'].apply(lambda x: process.extractOne(x, df1['Flow Order'],
scorer=fuzz.token_sort_ratio)
)
.rename('match')
)
output:
Subcategory_Description Segment_Description Flow match
0 POTATO POTATOES P/P PP Potato White 4kg (White Peach, 53, 1)
1 POTATO POTATOES P/P PP Potato White 1.5kg (White Peach, 50, 1)
2 POTATO POTATOES P/P PP Potato White 2kg (White Peach, 53, 1)
3 BANANA BANANAS OTHER (INC P/P) 5pk Kids Banana (Bananas, 64, 0)
4 STONE FRUIT PEACHES P/P PP Peach White (White Peach, 88, 1)
5 STONE FRUIT NECTARINES P/P PP Nectarine White (White Nectarine, 91, 2)
6 MANGOES KENSINGTON PRIDE Mango Kp (Bananas, 27, 0)
7 BANANA BANANAS CAVENDISH Banana (Bananas, 92, 0)
8 STONE FRUIT PEACHES LOOSE Peach White (White Peach, 100, 1)
9 STONE FRUIT NECTARINES LOOSE Nectarine (White Nectarine, 75, 2)
10 STONE FRUIT NECTARINES LOOSE Nectarine White (White Nectarine, 100, 2)
