Pandas: Adding a new column and values with a for loop-CodePudding

I have a dataframe: df1, B column values are list in a list

df1 = pd.DataFrame({'A': ['aa', 'bb'], 
                'B': [[[1,'cc'],[2,'aa']],[[3,'dd'],[4,'bb']]], 
                'ADDED': [10, 20]})

and dataframe: df2

df2 =  pd.DataFrame({'id': [1, 2, 3, 4], 
                     'number': [55, 66, 77, 88]})

I am trying to do string matching on column A and B

df3 = pd.DataFrame()
for i,val in enumerate(df1['A'].values):
   for val2 in df1['B'][i]:
       score = fuzz.partial_ratio(val, val2)
       if score > 99:
           df3 = df3.append(df2.loc[df2['id'] == val2[0], 'number'])

My result looks like this:

df3 = pd.DataFrame({'1': [66, 'NaN'], '3': ['NaN', 66]}, index=['number', 'number'])

Is there anyway to do this in an efficient way and the result would look like this: (added additional column from df1)

df4 = pd.DataFrame({'number': [66, 88], 'ADDED': [10, 20]})

CodePudding user response：

Using explode will allow you to look at each sub list without iteration over it. Then you can use the fuzzy match to filter the df1 and use join which defaults to joining on index to get your results.

from fuzzywuzzy import fuzz
df1 = pd.DataFrame({'A': ['aa', 'bb'], 
                'B': [[[1,'cc'],[2,'aa']],[[3,'dd'],[4,'bb']]], 
                'ADDED': [10, 20]})

df2 =  pd.DataFrame({'id': [1, 2, 3,4], 'number': [55, 66, 77, 88]})

df1 = df1.explode('B')
# Rowwise fuzzy match
df1 = df1[df1.apply(lambda x: fuzz.partial_ratio(x['A'], x['B'][1]) > 99, axis=1)]

df2[['number']].join(df1[['ADDED']], how='inner')

Output

    number  ADDED
0       55     10
1       66     20

CodePudding user response：

I think this code will work for me:

df1 = pd.DataFrame({'A': ['aa', 'bb'], 
            'B': [[[1,'cc'],[2,'aa']],[[3,'dd'],[4,'bb']]], 
            'ADDED': [10, 20]})
df2 =  pd.DataFrame({'id': [1, 2, 3, 4], 'number': [55, 66, 77, 88]})
df1 = df1.explode('B')
df1 = df1[df1.apply(lambda x: fuzz.partial_ratio(x['A'], x['B'][1]) > 99, 
       axis=1)]
df1['number'] = df1['B'].apply(lambda x: df2.loc[df2['id']==x[0], 
                'number'].values[0])

Output:

df1
    A        B  ADDED  number
0  aa  [2, aa]     10      66
1  bb  [4, bb]     20      88

Thank you, Chris.