I have a dataframe A with values that were entered by humans, so they have a degree of variance even though they refer to the same keyword: foo001, foo1, 0foo1 all mean foo1.
I have this other dataframe B with keywords as an index and properties associated to them in different columns.
My goal is to go through every row of dataframe A, and if a match for a keyword is found, it adds to this row the properties found in the dataframe B corresponding to the said keyword.
Dataframe A :
| index | col1 | col2 |
|---|---|---|
| 1 | bar | baz |
| 2 | foo01 | boo |
Dataframe B :
| index | col1 | col2 |
|---|---|---|
| foo | fooprop1 | fooprop2 |
| bim | bimprop1 | bimprop2 |
A match is found in Dataframe A, so it becomes :
Dataframe A :
| index | col1 | col2 | prop1 | prop2 |
|---|---|---|---|---|
| 1 | bar | baz | ||
| 2 | foo01 | boo | fooprop1 | fooprop2 |
I can figure out how to find matches for those keywords using regex. But i can't find a nifty way to do the other part, that is without using loops and all.
Thanks in advance if you have suggestions !
CodePudding user response:
This is my approach but you may change the part for similarity checking between strings of col1. I use fuzzywuzzy for similarity checking between strings and filter those who have score higher than 50:
import pandas as pd
import numpy as np
from fuzzywuzzy import process
df1 = pd.DataFrame({'col1': ['bar', 'foo01'],
'col2': ['baz', 'boo']})
df2 = pd.DataFrame({'col1': ['fooprop1', 'bimprop1'],
'col2': ['fooprop2', 'bimprop2']})
df2.columns = ['close_word', 'prop2']
# create a choice list
choices = df2['close_word'].values.tolist()
def find_closest(word):
close_word = process.extractOne(word, choices)
if close_word[1] > 50:
return close_word[0]
return np.nan
# apply fuzzywuzzy to each row using lambda expression
df1['close_word'] = df1['col1'].apply(lambda x: find_closest(x))
# merge
result = pd.merge(df1, df2, on="close_word", how="left")
print(result)
Output:
col1 col2 close_word prop2
0 bar baz NaN NaN
1 foo01 boo fooprop1 fooprop2
