How to use a vectorized operation based on column names?-CodePudding

Let's say I have a set-up like this

import pandas as pd 

def dummy(val1, val2):
    return val1 * val2 / 10


df = pd.DataFrame({'a': range(1, 3), 'b': range(2, 4), 'c': range(3, 5)})
d = {'a': 3, 'b': 10}

   a  b  c
0  1  2  3
1  2  3  4

Now I would like to apply dummy to the columns in df which exist as keys in d and add new columns; thereby val1 refers to the values in the respective columns and val2 to the value in d for the respective key.

I could do it like this

for k, v in d.items():
    # d[k] is of course just v; it's just to show that k is required for both input values
    df[f'{k}_calc'] = dummy(df[k], d[k])

which gives me the desired outcome

   a  b  c  a_calc  b_calc
0  1  2  3     0.3     2.0
1  2  3  4     0.6     3.0

Is there a more straightforward implementation available that avoids the loop?

CodePudding user response：

You can try something like this:

~~cols = pd.Index(['a','b'])~~

#Using @richardec idea of the dictionary keys
cols = pd.Index(d.keys())
df[cols   '_calc'] = df[cols].apply(lambda x: dummy(x, d[x.name]))
df

Output:

   a  b  c  a_calc  b_calc
0  1  2  3     0.3     2.0
1  2  3  4     0.6     3.0

Details:

Using pd.DataFrame.apply, you are passing each column of the dataframe into the lambda.
x.name is the column header.
Using x, and d[x.name] as inputs to your custom function.

CodePudding user response：

Not really. Calling a function for each row inevitable will lead to row-by-row execution one way or the other. But, you can do some things to make it more pandas-esque:

I was thinking that it could be done with agg, but I actually don't think so because agg can't use different pass different values to the aggregation functions based on the column. So using apply:

tmp = df[d.keys()].apply(lambda col: dummy(col, d[col.name])).add_suffix('_calc')
df = pd.concat([df, tmp], axis=1)

Output:

>>> df
   a  b  c  a_calc  b_calc
0  1  2  3     0.3     2.0
1  2  3  4     0.6     3.0

Note the use of add_suffix to add suffixes to the column names.

CodePudding user response：

I would convert the dictionary to Series to benefit from broadcasting and index alignment:

s = pd.Series(d)

df.join((df.reindex(columns=s.index)*s/10).add_suffix('_calc'))

output:

   a  b  c  a_calc  b_calc
0  1  2  3     0.3     2.0
1  2  3  4     0.6     3.0

CodePudding user response：

Convert d to a Series object, multiply element-wise with df, drop NaN valued columns (which appears because not all column names in df have a corresponding key in d) and concatenate the result with df on axis:

out = pd.concat((df, (df*pd.Series(d)/10).dropna(axis='columns').add_suffix('_calc')), axis=1)

Output:

   a  b  c  a_calc  b_calc
0  1  2  3     0.3     2.0
1  2  3  4     0.6     3.0