I have a dataframe: df1, B column values are list in a list
df1 = pd.DataFrame({'A': ['aa', 'bb'],
'B': [[[1,'cc'],[2,'aa']],[[3,'dd'],[4,'bb']]],
'ADDED': [10, 20]})
and dataframe: df2
df2 = pd.DataFrame({'id': [1, 2, 3, 4],
'number': [55, 66, 77, 88]})
I am trying to do string matching on column A and B
df3 = pd.DataFrame()
for i,val in enumerate(df1['A'].values):
for val2 in df1['B'][i]:
score = fuzz.partial_ratio(val, val2)
if score > 99:
df3 = df3.append(df2.loc[df2['id'] == val2[0], 'number'])
My result looks like this:
df3 = pd.DataFrame({'1': [66, 'NaN'], '3': ['NaN', 66]}, index=['number', 'number'])
Is there anyway to do this in an efficient way and the result would look like this: (added additional column from df1)
df4 = pd.DataFrame({'number': [66, 88], 'ADDED': [10, 20]})
CodePudding user response:
Using explode will allow you to look at each sub list without iteration over it. Then you can use the fuzzy match to filter the df1 and use join which defaults to joining on index to get your results.
from fuzzywuzzy import fuzz
df1 = pd.DataFrame({'A': ['aa', 'bb'],
'B': [[[1,'cc'],[2,'aa']],[[3,'dd'],[4,'bb']]],
'ADDED': [10, 20]})
df2 = pd.DataFrame({'id': [1, 2, 3,4], 'number': [55, 66, 77, 88]})
df1 = df1.explode('B')
# Rowwise fuzzy match
df1 = df1[df1.apply(lambda x: fuzz.partial_ratio(x['A'], x['B'][1]) > 99, axis=1)]
df2[['number']].join(df1[['ADDED']], how='inner')
Output
number ADDED
0 55 10
1 66 20
CodePudding user response:
I think this code will work for me:
df1 = pd.DataFrame({'A': ['aa', 'bb'],
'B': [[[1,'cc'],[2,'aa']],[[3,'dd'],[4,'bb']]],
'ADDED': [10, 20]})
df2 = pd.DataFrame({'id': [1, 2, 3, 4], 'number': [55, 66, 77, 88]})
df1 = df1.explode('B')
df1 = df1[df1.apply(lambda x: fuzz.partial_ratio(x['A'], x['B'][1]) > 99,
axis=1)]
df1['number'] = df1['B'].apply(lambda x: df2.loc[df2['id']==x[0],
'number'].values[0])
Output:
df1
A B ADDED number
0 aa [2, aa] 10 66
1 bb [4, bb] 20 88
Thank you, Chris.
