Search for keyword matches in one Dataframe contained in another Dataframe-CodePudding

I have a dataframe A with values that were entered by humans, so they have a degree of variance even though they refer to the same keyword: foo001, foo1, 0foo1 all mean foo1.

I have this other dataframe B with keywords as an index and properties associated to them in different columns.

My goal is to go through every row of dataframe A, and if a match for a keyword is found, it adds to this row the properties found in the dataframe B corresponding to the said keyword.

Dataframe A :

index	col1	col2
1	bar	baz
2	foo01	boo

Dataframe B :

index	col1	col2
foo	fooprop1	fooprop2
bim	bimprop1	bimprop2

A match is found in Dataframe A, so it becomes :

Dataframe A :

index	col1	col2	prop1	prop2
1	bar	baz
2	foo01	boo	fooprop1	fooprop2

I can figure out how to find matches for those keywords using regex. But i can't find a nifty way to do the other part, that is without using loops and all.

Thanks in advance if you have suggestions !

CodePudding user response：

This is my approach but you may change the part for similarity checking between strings of col1. I use fuzzywuzzy for similarity checking between strings and filter those who have score higher than 50:

import pandas as pd
import numpy as np
from fuzzywuzzy import process

df1 = pd.DataFrame({'col1': ['bar', 'foo01'],
                    'col2': ['baz', 'boo']})
df2 = pd.DataFrame({'col1': ['fooprop1', 'bimprop1'],
                    'col2': ['fooprop2', 'bimprop2']})
df2.columns = ['close_word', 'prop2']

# create a choice list
choices = df2['close_word'].values.tolist()


def find_closest(word):
    close_word = process.extractOne(word, choices)
    if close_word[1] > 50:
        return close_word[0]
    return np.nan


# apply fuzzywuzzy to each row using lambda expression
df1['close_word'] = df1['col1'].apply(lambda x: find_closest(x))

# merge
result = pd.merge(df1, df2, on="close_word", how="left")
print(result)

Output:

     col1 col2     close_word     prop2
0    bar  baz       NaN       NaN
1  foo01  boo   fooprop1  fooprop2