I have a table (df1) with a list of values (neig_list, which is a python list) in each row
| ID | neig_list |
|---|---|
| 1 | a, b, d |
| 2 | b, e, f, g, h |
| 3 | b, a, j, k |
And a table (df2) with entries for those values
| neig | samples | samples_indicator |
|---|---|---|
| 'a' | 3 | 0.5 |
| 'a' | 5 | 0.1 |
| 'b' | 1 | 0.2 |
| 'c' | 15 | 0.5 |
| 'd' | 12 | 0.3 |
| 'a' | 2 | 1 |
| 'e' | 5 | 0.6 |
| 'f' | 6 | 0 |
| 'h' | 6 | 0.5 |
I need to add a column to df1 getting, for each row, the result for the sum of samples x samples_indicator for all neigs that are contained in the neig_list for that row.
For example, for the first row, we would have:
3*0.5 5*0.1 1*0.2 12*0.3 2*1 = 7.8
| ID | neig_list | new_column |
|---|---|---|
| 1 | a, b, d | 7.8 |
| 2 | b, e, f, g, h | value |
| 3 | b, a, j, k | value |
Actually, the function is more complicated than that (involves more columns), so ideally I'd like to have a separate function and then apply it to df1, based on df2.
CodePudding user response:
Calculate first your math in df2:
map_ = df2.assign(neig = df2['neig'].str.strip("'"),
calculated = lambda df: df['samples'] * df['samples_indicator'])\
.groupby('neig')['calculated'].sum()
Then, explode your first df, and map the values above for 'a', 'b' etc with the calculated formula. Finally, groupby and sum:
df['new_column'] = df['neig_list'].str.split(', ').explode()\
.map(map_)\
.groupby(level=0)\
.sum()
ID neig_list new_column
0 1 a, b, d 7.8
1 2 b, e, f, g, h 6.2
2 3 b, a, j, k 4.2
CodePudding user response:
You can just define a function that performs the calculations for a given list of neigs using df2 and then just apply it to neig_list in df1:
def result(row):
return sum([df2['samples'][item]*df2['samples_indicator'][item] for item in row])
df1['new_column'] = df1['neig_list'].apply(result)
Note that this requires neig to be the index in df2. If it's not, you can do df2.set_index('neig', inplace=True) or, if you don't want to modify d2:
def result(row):
return sum([df2.set_index('neig')['samples'][item]*df2.set_index('neig')['samples_indicator'][item] for item in row])
and apply it the same way as before.
CodePudding user response:
Here's a way to do what your question asks:
def foo(df1, df2):
return (df1
.join(df1.assign(neig=df1.neig_list).explode('neig')
.join(
df2.assign(new_column=df2.samples * df2.samples_indicator)[['neig','new_column']].groupby('neig').sum(),
on='neig')
.drop(columns=['neig','neig_list']).groupby('ID').sum(),
on='ID')
)
print(foo(df1, df2))
Output:
ID neig_list new_column
0 1 [a, b, d] 7.8
1 2 [b, e, f, g, h] 6.2
2 3 [b, a, j, k] 4.2
Explanation:
- use
assign()to addnew_columnas a column todf2which, usinggroupby()andsum(), gets populated with the dot-product ofsamplesandsamples_indicatorfor the rows in eachneiggroup - use
assign()to clone theneig_listcolumn ofdf1asneigandexplode()to expand each row to one row per item in theneigcolumn - use
join()on the above two DataFrame objects to put sample results fromnew_columninto each row based on itsneigvalue - use
join()again with the above DataFrame object (after dropping theneigandneig_listcolumns) to add the desired column to the originaldf1.
