Home > database >  How do you recalculate Standard Deviation at each row in a Dataframe?
How do you recalculate Standard Deviation at each row in a Dataframe?

Time:01-30

I'm trying to calculate the standard deviation of the 'Cost' column at each row for each unique Client Id. Each row adds new values which then determines the new standard deviation. I'm am looking to generate the 'Standard Deviation' column shown below!

Client ID    Session  Cost    Standard Deviation
1            0        10      NaN
1            1        11      0.5000
1            2        14      1.6997
2            0        15      NaN
2            1        16      0.5000
2            2        14      0.8165
2            3        22      3.1122

CodePudding user response:

This would do the trick:

df["Standard Deviation"] = df.groupby("Client ID")["Cost"].expanding(2).std(ddof=0).reset_index()["Cost"]
   Client ID  Session  Cost  Standard Deviation
0          1        0    10                 NaN
1          1        1    11            0.500000
2          1        2    14            1.699673
3          2        0    15                 NaN
4          2        1    16            0.500000
5          2        2    14            0.816497
6          2        3    22            3.112475

Explanation

You can rephrase your problem as:

Finding the cumulative standard deviation of the "Cost" column grouped by the "Client ID" column.

Pandas conveniently has built-in functions that handle both cumulative and group by computations.

Group By

A group by to compute the standard deviation looks like this:

df.groupby("Client ID")["Cost"].std()
Client ID
1    2.081666
2    3.593976

Cumulative

The cumulative standard deviation can be computed like this (note, we use ddof=0 to get the standard deviation of the population as a whole, which is what we want. we also use min_periods=2, otherwise the first row would have a value of 0.0 instead of NaN):

df.expanding(min_periods=2)["Cost"].std(ddof=0)
0         NaN
1    0.707107
2    2.081666
3    2.380476
4    2.588436
5    2.338090
6    3.909695

Group By Cumulative

Combining the two, we get our result (note, we need to reset the index to lose the group by indexing and use the original index):

df.groupby("Client ID")["Cost"].expanding(2).std(ddof=0).reset_index()["Cost"]
0         NaN
1    0.500000
2    1.699673
3         NaN
4    0.500000
5    0.816497
6    3.112475
  •  Tags:  
  • Related