Home > Blockchain >  Pandas `.assign` to multiple columns from single function
Pandas `.assign` to multiple columns from single function

Time:01-18

I have a Pandas dataframe which contains a single column val, as well as a function func which takes in a value and spits out a list of some fixed length (let's say 4). I also have a list cols of 4 strings. I would like to apply func to every cell and add 4 new columns, labelled according to my list.

What seems to work is something like this:

import pandas as pd

df = pd.DataFrame({'val': [1, 2, 4, 18, 9, 1]})
cols = ["X", "Y", "Z", "hello"]
func = lambda x: [2**x, str(x), x 1, "world"]

df[cols] = df['val'].apply(lambda val: pd.Series(func(val)))

Since I see everyone recommend against using apply I wanted to try and do it with assign. I tried assigning the output of func to a temporary column tmp, and then extract the individual values one by one as follows:

import pandas as pd

df = pd.DataFrame({'val': [1, 2, 4, 18, 9, 1]})
cols = ["X", "Y", "Z", "hello"]
func = lambda x: [2**x, str(x), x 1, "world"]

kwargs = {name: (lambda x: x.tmp[idx]) for idx, name in enumerate(cols)}
df[cols] = df.assign(tmp=lambda x: pd.Series(func(x.val)), **kwargs)

But this throws some kind of error that I'm not sure how to interpret ValueError: Columns must be same length as key. Note that the documentation for .assign [1] says that this kind of self-reference is allowed, see the last example.

[1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html

EDIT: For the sake of clarification, in my actual application the call to func is sufficiently expensive that I don't want to call it four times per row. It's also not easily separable into four subcomponents like in my example.

CodePudding user response:

The reason is that you are using assign in the wrong way. When you use assign, you are creating a new dataframe which for each assign call adds a new column to the initial dataframe. This means this call here df.assign(tmp=lambda x: pd.Series(func(x.val)), **kwargs) create a dataframe with 6 columns namely ["val", "tmp", "X", "Y", "Z", "hello"]. However, you try to assign it to only 4 columns when you do df[cols] = as cols as only 4 entries. This explains the error you have provided. What would work instead is either

df[["val", 'tmp', *cols]] = df.assign(tmp=lambda x: pd.Series(func(x.val)), **kwargs)

or

df[cols] = df.assign(tmp=lambda x: pd.Series(func(x.val)), **kwargs)[cols]
# Here you would lose the tmp column

or simply

df = df.assign(tmp=lambda x: pd.Series(func(x.val)), **kwargs)

CodePudding user response:

You'd have to do some testing to see if the original func is more performant over the assign method below.

df = pd.DataFrame({'val': [1, 2, 4, 18, 9, 1]})
df = df.assign(X=2**df['val'],
               Z=df['val'] 1,
               Y=df['val'].astype('str'),
               world='hello')

   val       X   Z   Y  world
0    1       2   2   1  hello
1    2       4   3   2  hello
2    4      16   5   4  hello
3   18  262144  19  18  hello
4    9     512  10   9  hello
5    1       2   2   1  hello

CodePudding user response:

import pandas as pd

df = pd.DataFrame({'val': [1, 2, 4, 18, 9, 1]})
cols = ["X", "Y", "Z", "hello"]
func = lambda x: [2**x, str(x), x 1, "world"]

df[cols] = df['val'].apply(lambda val: pd.Series(func(val)), result_type='expand')

I think the addition of expand will give the correct result.

  •  Tags:  
  • Related