Need to match string considering two or more columns instead of one column-CodePudding

Input: df1:

Flow Order
Bananas
White Peach 
White Nectarine

df2

Subcategory_Description     Segment_Description     Flow
POTATO                      POTATOES P/P            PP Potato White 4kg
POTATO                      POTATOES P/P            PP Potato White 1.5kg
POTATO                      POTATOES P/P            PP Potato White 2kg
BANANA                      BANANAS OTHER (INC P/P) 5pk Kids Banana
STONE FRUIT                 PEACHES P/P             PP Peach White
STONE FRUIT                 NECTARINES P/P          PP Nectarine White
MANGOES                     KENSINGTON PRIDE        Mango Kp
BANANA                      BANANAS CAVENDISH       Banana
STONE FRUIT                 PEACHES LOOSE           Peach White
STONE FRUIT                 NECTARINES LOOSE        Nectarine
STONE FRUIT                 NECTARINES LOOSE        Nectarine White

Scenario: Have to take all matching flows from df2, having df1 as base flow. I have used Fuzzywuzzy library from python for doing so, but the problem here is it's considering partial match as well. For example, in expected Output for Flow - 'White Peach', White Potato is considered here (which shouldn't happen).

Python Code till above mentioned scenario is given below:

import pandas as pd
from fuzzywuzzy import process

# converting to pandas dataframes
dframe1 = pd.DataFrame(df1['Flow Order'])
dframe2 = pd.DataFrame(df2['Flow'])

# empty lists for storing the matches later
mat1 = []
mat2 = []
p = []

# printing the pandas dataframes
print("First dataframe:\n", dframe1,
    "\nSecond dataframe:\n", dframe2)

# converting dataframe column to list
# of elements
# to do fuzzy matching
list1 = dframe1['Flow Order'].tolist()
list2 = dframe2['Flow'].tolist()

# taking the threshold as 82
threshold = 82

# iterating through list1 to extract
# it's closest match from list2
for i in list1:
    mat1.append(process.extract(i, list2, limit=5))
dframe1['matches'] = mat1

# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
    for k in j:
        if k[1] >= threshold:
            p.append(k[0])
    mat2.append(",".join(p))
    p = []


# storing the resultant matches back to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching:")
print("dframe1:", dframe1)

Above Code Output: dframe1

Flow Order          matches
Bananas             Banana,5pk Kids Banana
White Peach         PP Peach White,Peach White,PP Potato White 4kg,PP Potato White 1.5kg,PP Potato White 2kg
White Nectarine     PP Nectarine White,Nectarine White,Nectarine,PP Potato White 2kg,PP Potato White 1kg

So the Idea to be included in above code is to consider 'Subcategory_Description' and 'Segment_Description' from df2 as well to get the perfect match flows alone. These two columns can be used to take out main names for matching strings using fuzzywuzzy. But I'm not sure how to include those.

Expected Output:

Flow Order          matches
Bananas             Banana,5pk Kids Banana
White Peach         PP Peach White,Peach White
White Nectarine     PP Nectarine White,Nectarine White,Nectarine

Kindly help me out. Thanks in Advance!

CodePudding user response：

You can try using the scorer token_sort_ratio (check out the code), it will attempt to account for similar strings out of order. Combining with your threshold, it should work:

from fuzzywuzzy import fuzz
process.extract('White Peach', ['PP Peach White','Peach White','PP Potato White 4kg',
                                'PP Potato White 1.5kg','PP Potato White 2kg'], 
                scorer=fuzz.token_sort_ratio)

which will give

[('Peach White', 100),
 ('PP Peach White', 88),
 ('PP Potato White 4kg', 53),
 ('PP Potato White 2kg', 53),
 ('PP Potato White 1.5kg', 50)]

CodePudding user response：

You can use thefuzz (the new name of former fuzzywuzzy) with process.extractOne. To improve matching, use the scorer=fuzz.token_sort_ratio and set a high enough cutoff (here: score_cutoff=80). The ideal cutoff should be determined empirically.

I am using df1 and df2 as dataframe names.

from thefuzz import process, fuzz

match = df2['Flow'].apply(lambda x: out[0] if
                                   (out:=process.extractOne(x, df1['Flow Order'],
                                                            scorer=fuzz.token_sort_ratio,
                                                            score_cutoff=80))
                                    else None)

df2.groupby(match).agg(matches=('Flow', ','.join))

output:

                                            matches
Flow                                               
Bananas                                      Banana
White Nectarine  PP Nectarine White,Nectarine White
White Peach              PP Peach White,Peach White

To help determine an ideal cutoff, you could run:

df2.join(df2['Flow'].apply(lambda x: process.extractOne(x, df1['Flow Order'], 
                                                        scorer=fuzz.token_sort_ratio)
                           )
                    .rename('match')
         )

output:

   Subcategory_Description      Segment_Description                   Flow                      match
0                   POTATO             POTATOES P/P    PP Potato White 4kg       (White Peach, 53, 1)
1                   POTATO             POTATOES P/P  PP Potato White 1.5kg       (White Peach, 50, 1)
2                   POTATO             POTATOES P/P    PP Potato White 2kg       (White Peach, 53, 1)
3                   BANANA  BANANAS OTHER (INC P/P)        5pk Kids Banana           (Bananas, 64, 0)
4              STONE FRUIT              PEACHES P/P         PP Peach White       (White Peach, 88, 1)
5              STONE FRUIT           NECTARINES P/P     PP Nectarine White   (White Nectarine, 91, 2)
6                  MANGOES         KENSINGTON PRIDE               Mango Kp           (Bananas, 27, 0)
7                   BANANA        BANANAS CAVENDISH                 Banana           (Bananas, 92, 0)
8              STONE FRUIT            PEACHES LOOSE            Peach White      (White Peach, 100, 1)
9              STONE FRUIT         NECTARINES LOOSE              Nectarine   (White Nectarine, 75, 2)
10             STONE FRUIT         NECTARINES LOOSE        Nectarine White  (White Nectarine, 100, 2)