I have a Pandas dataframe which contains a single column val, as well as a function func which takes in a value and spits out a list of some fixed length (let's say 4). I also have a list cols of 4 strings. I would like to apply func to every cell and add 4 new columns, labelled according to my list.
What seems to work is something like this:
import pandas as pd
df = pd.DataFrame({'val': [1, 2, 4, 18, 9, 1]})
cols = ["X", "Y", "Z", "hello"]
func = lambda x: [2**x, str(x), x 1, "world"]
df[cols] = df['val'].apply(lambda val: pd.Series(func(val)))
Since I see everyone recommend against using apply I wanted to try and do it with assign. I tried assigning the output of func to a temporary column tmp, and then extract the individual values one by one as follows:
import pandas as pd
df = pd.DataFrame({'val': [1, 2, 4, 18, 9, 1]})
cols = ["X", "Y", "Z", "hello"]
func = lambda x: [2**x, str(x), x 1, "world"]
kwargs = {name: (lambda x: x.tmp[idx]) for idx, name in enumerate(cols)}
df[cols] = df.assign(tmp=lambda x: pd.Series(func(x.val)), **kwargs)
But this throws some kind of error that I'm not sure how to interpret ValueError: Columns must be same length as key. Note that the documentation for .assign [1] says that this kind of self-reference is allowed, see the last example.
[1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
EDIT: For the sake of clarification, in my actual application the call to func is sufficiently expensive that I don't want to call it four times per row. It's also not easily separable into four subcomponents like in my example.
CodePudding user response:
The reason is that you are using assign in the wrong way. When you use assign, you are creating a new dataframe which for each assign call adds a new column to the initial dataframe. This means this call here df.assign(tmp=lambda x: pd.Series(func(x.val)), **kwargs) create a dataframe with 6 columns namely ["val", "tmp", "X", "Y", "Z", "hello"]. However, you try to assign it to only 4 columns when you do df[cols] = as cols as only 4 entries. This explains the error you have provided. What would work instead is either
df[["val", 'tmp', *cols]] = df.assign(tmp=lambda x: pd.Series(func(x.val)), **kwargs)
or
df[cols] = df.assign(tmp=lambda x: pd.Series(func(x.val)), **kwargs)[cols]
# Here you would lose the tmp column
or simply
df = df.assign(tmp=lambda x: pd.Series(func(x.val)), **kwargs)
CodePudding user response:
You'd have to do some testing to see if the original func is more performant over the assign method below.
df = pd.DataFrame({'val': [1, 2, 4, 18, 9, 1]})
df = df.assign(X=2**df['val'],
Z=df['val'] 1,
Y=df['val'].astype('str'),
world='hello')
val X Z Y world
0 1 2 2 1 hello
1 2 4 3 2 hello
2 4 16 5 4 hello
3 18 262144 19 18 hello
4 9 512 10 9 hello
5 1 2 2 1 hello
CodePudding user response:
import pandas as pd
df = pd.DataFrame({'val': [1, 2, 4, 18, 9, 1]})
cols = ["X", "Y", "Z", "hello"]
func = lambda x: [2**x, str(x), x 1, "world"]
df[cols] = df['val'].apply(lambda val: pd.Series(func(val)), result_type='expand')
I think the addition of expand will give the correct result.
