Home > database >  How to stop and restart cumsum using a marker in another column
How to stop and restart cumsum using a marker in another column

Time:01-25

I have a pandas dataframe with values that needs to be totalized inside a period for each device, but the periods ends are marked in another column (an easy identifiable event line). The cumsum must go until it finds this end of period marker and then start again from zero(the first value on the next line).

  device_name value end
0   A5         1    False
1   A5         7    False
2   A5         2    True
3   A5         1    False
4   A5         1    False
5   A5         1    False
6   A5         1    True
7   A6         2    False
8   A6         4    False
9   A6         2    False
10  A6         2    True
11  A6         2    False
12  A6         2    False

the result should be something like

  device_name value end     total
0   A5         1    False    1  
1   A5         7    False    8
2   A5         2    True     10
3   A5         1    False    1
4   A5         1    False    2
5   A5         1    False    3
6   A5         1    True     4
7   A6         2    False    2
8   A6         4    False    6
9   A6         2    False    8
10  A6         2    True     10
11  A6         2    False    2
12  A6         2    False    4

I tried cumsum but I can't separate the end of a period from the other, I can iterate over rows and if i find a True on df.end.shift(1) i ignore the last value, but there are at least 60000 and probably there are faster methods, Any suggestion?

CodePudding user response:

IIUC, you want to cumsum per group until you reach a True. Then, after this row, restart the count.

You can use an extra group based on the "end" value (also using a cumsum):

df['total'] = (df.groupby(['device_name',
                           df['end'].shift(1, fill_value=0).cumsum()])
               ['value'].cumsum())

output:

   device_name  value    end  total
0           A5      1  False      1
1           A5      7  False      8
2           A5      2   True     10
3           A5      1  False      1
4           A5      1  False      2
5           A5      1  False      3
6           A5      1   True      4
7           A6      2  False      2
8           A6      4  False      6
9           A6      2  False      8
10          A6      2   True     10
11          A6      2  False      2
12          A6      2  False      4

NB. note that I get a different value for row #2

NB.2. for purists, the extra group could also be computed using a groupby. It doesn't really matter in this case. The internal groups will just not start from zero after the first group, but their name will not be used anywhere in the output.

CodePudding user response:

cnt = 0
def fetch_flag_col(x):
    global cnt
    resp = cnt
    if(x): cnt =1
    return resp

df["flag"] = df["end"].apply(fetch_flag_col)
df["total"] = df.groupby(["device_name ","flag"])["Value"].cumsum()

I am not sure if there is a direct function or a sophisticated way to do this using native pandas api's, but using the above method you can achieve the desired result in O(n) time complexity.

What we are doing is, creating a intermediate column called flag, which will help .cumsum(), method decide between what range of rows, it has to do the cumulative.

  •  Tags:  
  • Related