Home > Enterprise >  Pandas create new column base on groupby and apply lambda if statement
Pandas create new column base on groupby and apply lambda if statement

Time:01-18

I have the issue with groupby and apply

df = pd.DataFrame({'A': ['a', 'a', 'a', 'b', 'b', 'b', 'b'], 'B': np.r_[1:8]})

I want to create a column C for each group take value 1 if B > z_score=2 and 0 otherwise. The code:

from scipy import stats
df['C'] = df.groupby('A').apply(lambda x: 1 if np.abs(stats.zscore(x['B'], nan_policy='omit')) > 2 else 0, axis=1)

However, I am unsuccessful with code and cannot figure out the issue

CodePudding user response:

Use GroupBy.transformwith lambda, function, then compare and for convert True/False to 1/0 convert to integers:

from scipy import stats

s = df.groupby('A')['B'].transform(lambda x: np.abs(stats.zscore(x, nan_policy='omit')))
df['C'] = (s > 2).astype(int)

Or use numpy.where:

df['C'] = np.where(s > 2, 1, 0)

Error in your solution is per groups:

from scipy import stats

df = df.groupby('A')['B'].apply(lambda x: 1 if np.abs(stats.zscore(x, nan_policy='omit')) > 2 else 0)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

If check gotcha in pandas docs:

pandas follows the NumPy convention of raising an error when you try to convert something to a bool. This happens in an if-statement or when using the boolean operations: and, or, and not.

So if use one of solutions instead if-else:

from scipy import stats

df = df.groupby('A')['B'].apply(lambda x: (np.abs(stats.zscore(x, nan_policy='omit')) > 2).astype(int))

print (df)
A
a       [0, 0, 0]
b    [0, 0, 0, 0]
Name: B, dtype: object

but then need convert to column, for avoid this problems is used groupby.transform.

CodePudding user response:

You can use groupby apply a function that finds the z-scores of each item in each group; explode the resulting list; use gt to create a boolean series and convert it to dtype int

df['C'] = df.groupby('A')['B'].apply(lambda x: stats.zscore(x, nan_policy='omit')).explode(ignore_index=True).abs().gt(2).astype(int)

Output:

   A  B  C
0  a  1  0
1  a  2  0
2  a  3  0
3  b  4  0
4  b  5  0
5  b  6  0
6  b  7  0
  •  Tags:  
  • Related