Python Pandas - Applying function to a list of values for each row-CodePudding

I have a table (df1) with a list of values (neig_list, which is a python list) in each row

ID	neig_list
1	a, b, d
2	b, e, f, g, h
3	b, a, j, k

And a table (df2) with entries for those values

neig	samples	samples_indicator
'a'	3	0.5
'a'	5	0.1
'b'	1	0.2
'c'	15	0.5
'd'	12	0.3
'a'	2	1
'e'	5	0.6
'f'	6	0
'h'	6	0.5

I need to add a column to df1 getting, for each row, the result for the sum of samples x samples_indicator for all neigs that are contained in the neig_list for that row.

For example, for the first row, we would have:

3*0.5 5*0.1 1*0.2 12*0.3 2*1 = 7.8

ID	neig_list	new_column
1	a, b, d	7.8
2	b, e, f, g, h	value
3	b, a, j, k	value

Actually, the function is more complicated than that (involves more columns), so ideally I'd like to have a separate function and then apply it to df1, based on df2.

CodePudding user response：

Calculate first your math in df2:

map_ = df2.assign(neig = df2['neig'].str.strip("'"), 
                  calculated = lambda df: df['samples'] * df['samples_indicator'])\
          .groupby('neig')['calculated'].sum()

Then, explode your first df, and map the values above for 'a', 'b' etc with the calculated formula. Finally, groupby and sum:

df['new_column'] = df['neig_list'].str.split(', ').explode()\
                                  .map(map_)\
                                  .groupby(level=0)\  
                                  .sum()

   ID      neig_list  new_column
0   1        a, b, d         7.8
1   2  b, e, f, g, h         6.2
2   3     b, a, j, k         4.2

CodePudding user response：

You can just define a function that performs the calculations for a given list of neigs using df2 and then just apply it to neig_list in df1:

def result(row):
    return sum([df2['samples'][item]*df2['samples_indicator'][item] for item in row])

df1['new_column'] = df1['neig_list'].apply(result)

Note that this requires neig to be the index in df2. If it's not, you can do df2.set_index('neig', inplace=True) or, if you don't want to modify d2:

def result(row):
    return sum([df2.set_index('neig')['samples'][item]*df2.set_index('neig')['samples_indicator'][item] for item in row])

and apply it the same way as before.

CodePudding user response：

Here's a way to do what your question asks:

def foo(df1, df2):
    return (df1
        .join(df1.assign(neig=df1.neig_list).explode('neig')
            .join(
                df2.assign(new_column=df2.samples * df2.samples_indicator)[['neig','new_column']].groupby('neig').sum(), 
                on='neig')
                .drop(columns=['neig','neig_list']).groupby('ID').sum(), 
            on='ID')
        )
print(foo(df1, df2))

Output:

   ID        neig_list  new_column
0   1        [a, b, d]         7.8
1   2  [b, e, f, g, h]         6.2
2   3     [b, a, j, k]         4.2

Explanation:

use assign() to add new_column as a column to df2 which, using groupby() and sum(), gets populated with the dot-product of samples and samples_indicator for the rows in each neig group
use assign() to clone the neig_list column of df1 as neig and explode() to expand each row to one row per item in the neig column
use join() on the above two DataFrame objects to put sample results from new_column into each row based on its neig value
use join() again with the above DataFrame object (after dropping the neig and neig_list columns) to add the desired column to the original df1.

neig	samples	samples_indicator
'a'	3	0.5
'a'	5	0.1
'b'	1	0.2
'c'	15	0.5
'd'	12	0.3
'a'	2	1
'e'	5	0.6
'f'	6	0
'h'	6	0.5

neig	samples	samples_indicator
'a'	3	0.5
'a'	5	0.1
'b'	1	0.2
'c'	15	0.5
'd'	12	0.3
'a'	2	1
'e'	5	0.6
'f'	6	0
'h'	6	0.5

neig	samples	samples_indicator
'a'	3	0.5
'a'	5	0.1
'b'	1	0.2
'c'	15	0.5
'd'	12	0.3
'a'	2	1
'e'	5	0.6
'f'	6	0
'h'	6	0.5