Pandas `.assign` to multiple columns from single function-CodePudding

I have a Pandas dataframe which contains a single column val, as well as a function func which takes in a value and spits out a list of some fixed length (let's say 4). I also have a list cols of 4 strings. I would like to apply func to every cell and add 4 new columns, labelled according to my list.

What seems to work is something like this:

import pandas as pd

df = pd.DataFrame({'val': [1, 2, 4, 18, 9, 1]})
cols = ["X", "Y", "Z", "hello"]
func = lambda x: [2**x, str(x), x 1, "world"]

df[cols] = df['val'].apply(lambda val: pd.Series(func(val)))

Since I see everyone recommend against using apply I wanted to try and do it with assign. I tried assigning the output of func to a temporary column tmp, and then extract the individual values one by one as follows:

import pandas as pd

df = pd.DataFrame({'val': [1, 2, 4, 18, 9, 1]})
cols = ["X", "Y", "Z", "hello"]
func = lambda x: [2**x, str(x), x 1, "world"]

kwargs = {name: (lambda x: x.tmp[idx]) for idx, name in enumerate(cols)}
df[cols] = df.assign(tmp=lambda x: pd.Series(func(x.val)), **kwargs)

But this throws some kind of error that I'm not sure how to interpret ValueError: Columns must be same length as key. Note that the documentation for .assign [1] says that this kind of self-reference is allowed, see the last example.

[1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html

EDIT: For the sake of clarification, in my actual application the call to func is sufficiently expensive that I don't want to call it four times per row. It's also not easily separable into four subcomponents like in my example.

CodePudding user response：

The reason is that you are using assign in the wrong way. When you use assign, you are creating a new dataframe which for each assign call adds a new column to the initial dataframe. This means this call here df.assign(tmp=lambda x: pd.Series(func(x.val)), **kwargs) create a dataframe with 6 columns namely ["val", "tmp", "X", "Y", "Z", "hello"]. However, you try to assign it to only 4 columns when you do df[cols] = as cols as only 4 entries. This explains the error you have provided. What would work instead is either

df[["val", 'tmp', *cols]] = df.assign(tmp=lambda x: pd.Series(func(x.val)), **kwargs)

df[cols] = df.assign(tmp=lambda x: pd.Series(func(x.val)), **kwargs)[cols]
# Here you would lose the tmp column

or simply

df = df.assign(tmp=lambda x: pd.Series(func(x.val)), **kwargs)

CodePudding user response：

You'd have to do some testing to see if the original func is more performant over the assign method below.

df = pd.DataFrame({'val': [1, 2, 4, 18, 9, 1]})
df = df.assign(X=2**df['val'],
               Z=df['val'] 1,
               Y=df['val'].astype('str'),
               world='hello')

   val       X   Z   Y  world
0    1       2   2   1  hello
1    2       4   3   2  hello
2    4      16   5   4  hello
3   18  262144  19  18  hello
4    9     512  10   9  hello
5    1       2   2   1  hello

CodePudding user response：

import pandas as pd

df = pd.DataFrame({'val': [1, 2, 4, 18, 9, 1]})
cols = ["X", "Y", "Z", "hello"]
func = lambda x: [2**x, str(x), x 1, "world"]

df[cols] = df['val'].apply(lambda val: pd.Series(func(val)), result_type='expand')

I think the addition of expand will give the correct result.