I have a dataframe with two conditions. I want to count the rows from when condtion 1 is first time hit until condition 2 is hit, then start over again and wait for the next time cond1 gets hit again and sum up til cond2
What i want to sum is the the "change" column in that span of rows between the conditions.
cond 1 indexes are = 4, 8, 12, 16
cond 2 indexes are = 10, 16, 19, 23
To sum it up i want to sum the change column from index 4 to index 10 skipping index 8, then from 12 to 16, then 16 to 19 and so forth.
import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)
dfA = pd.DataFrame(randn(100,2),columns='A B'.split())
dfA['B'] = dfA['B'] 10
def analyse(df):
df = df.copy()
#df['change'] = 10 for sake for simplicity this is more logical to test
df['change'] = df.B.pct_change()
df['cond1'] = np.where(df['A'] < -1 ,1,0)
df['cond2'] = np.where(df['B'] > 11, 1,0)
df.dropna(inplace=True)
return df
print(analyse(dfA))
output sample:
A B change cond1 cond2
1 0.000357 9.565909 -0.105097 0 0
2 -0.360938 10.326678 0.079529 0 0
3 0.968598 9.536587 -0.076510 0 0
4 -1.144810 9.565296 0.003010 1 0
5 1.107228 10.781904 0.127190 0 0
...
CodePudding user response:
The logic for grouping is complicated enough so that we need to write a function to do it, so we start with that:
def groups_from_conditions(cond1, cond2):
'''
assign a unique non-NaN integer to each group as defined by the rules
'''
n = len(cond1)
group_idx = -1
groups = np.zeros(n)
curr_state = 0 # 0 = not in a group, 1 = in a group
for n in range(n):
if curr_state == 0:
# Currently not in a group
if cond1[n] == 1:
# Detected start of a group. so:
# switch the state to 1 ie in a group
curr_state = 1
# get a new group_idx
group_idx = group_idx 1
# assign it to the output for element n
groups[n] = group_idx
else:
# no start of the group detected, we are not in a group so mark as NaN
groups[n] = np.NaN
else:
# current_state == 1 so we are in a group
if cond2[n] == 1:
# detected end of group -- switch state to 0
curr_state = 0
# as we are in a group assign current group_idx. Note that this happens for the element
# for which cond2[n] == 1 as well, ie this element is included
groups[n] = group_idx
return groups
To see that it is doing the right thing let's apply it to the first 25 rows and observe the results
# calculate cond1 and cond2
df2 = analyse(dfA)
# Test our logic
test_cond1 = df2.loc[:25, 'cond1'].values
test_cond2 = df2.loc[:25, 'cond2'].values
test_groups = groups_from_conditions(test_cond1, test_cond2)
# stick in test_df for ease of printing
test_df = pd.DataFrame({'cond1' : test_cond1, 'cond2' : test_cond2, 'groups' : test_groups})
test_df
we obtain
cond1 cond2 groups
-- ------- ------- --------
0 0 0 nan
1 0 0 nan
2 0 0 nan
3 1 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 1 0 0
8 0 0 0
9 0 1 0
10 0 0 nan
11 1 0 1
12 0 0 1
13 0 0 1
14 0 0 1
15 1 1 1
16 0 0 nan
17 1 0 2
18 1 1 2
19 1 0 3
20 0 0 3
21 0 0 3
22 0 1 3
23 0 0 nan
24 0 0 nan
we see that a unique integer id (in groups) is assigned to each group and it seems to satisfy the requirements
Now we can just groupby on the output of our function and sum the right column
df2.groupby(groups_from_conditions(df2['cond1'].values, df2['cond2'].values))['change'].sum().to_frame()
output
change
-- ----------
0 0.21502
1 0.173852
2 0.138442
3 0.00701182
4 0.209301
5 0.175492
6 0.0209937
7 0.146636
8 0.290243
Edit -- better output
This also returns the start and the end index for each group:
df2.reset_index().groupby(groups_from_conditions(df2['cond1'].values, df2['cond2'].values)).agg({'change':sum, 'index':['first','last']})
output
change index
sum first last
0 0.215020 4 10
1 0.173852 12 16
2 0.138442 18 19
3 0.007012 20 23
4 0.209301 45 46
5 0.175492 50 56
6 0.020994 57 61
7 0.146636 62 71
8 0.290243 85 99
CodePudding user response:
It took me some time to understand the logic, but transform can do a good job here.
If would group the dataframe by blocs ending with a 1 in the cond2 column, and transform each bloc using an auxiliary function to give the result the sum of the changes column starting with the first 1 value in cond1.
The auxiliary function is:
def process(f):
x = df.loc[f.index, 'cond1'].cummax()
val = f[x > 0].sum()
return np.where(x > 0, val, 0)
We can now directly get the expected result with:
df = analyse(dfA)
df['changes_sum'] = df.groupby(df['cond2'].shift().fillna(0).cumsum()
)['change'].transform(process)
to get:
A B change cond1 cond2 changes_sum
1 0.000357 9.565909 -0.105097 0 0 0.000000
2 -0.360938 10.326678 0.079529 0 0 0.000000
3 0.968598 9.536587 -0.076510 0 0 0.000000
4 -1.144810 9.565296 0.003010 1 0 0.215020
5 1.107228 10.781904 0.127190 0 0 0.215020
6 0.113930 9.230490 -0.143891 0 0 0.215020
7 -0.186869 8.754208 -0.051599 0 0 0.215020
8 -2.016947 8.509005 -0.028010 1 0 0.215020
9 1.324123 10.621020 0.248209 0 0 0.215020
10 -0.021689 11.259448 0.060110 0 1 0.215020
11 -0.303271 9.845053 -0.125618 0 0 0.000000
12 -1.274105 9.730489 -0.011637 1 0 0.173852
13 -0.299103 9.541319 -0.019441 0 0 0.173852
14 -0.374649 10.755250 0.127229 0 0 0.173852
15 0.426690 10.450911 -0.028297 0 0 0.173852
16 -2.347033 11.558679 0.105997 1 1 0.173852
17 0.213736 9.951042 -0.139085 0 0 0.000000
18 -1.701469 10.251330 0.030176 1 0 0.138442
19 -1.485369 11.361190 0.108265 1 1 0.138442
20 -1.311128 10.514855 -0.074494 1 0 0.007012
21 1.158399 8.753032 -0.167556 0 0 0.007012
22 0.065564 9.997387 0.142163 0 0 0.007012
23 2.517158 11.066090 0.106898 0 1 0.007012
24 -0.388263 8.438005 -0.237490 0 0 0.000000
25 0.822394 8.924752 0.057685 0 0 0.000000
26 0.122491 10.624278 0.190428 0 0 0.000000
27 -0.882316 9.728327 -0.084330 0 0 0.000000
28 -0.245772 10.567543 0.086265 0 0 0.000000
29 0.172045 10.855364 0.027236 0 0 0.000000
30 1.709001 9.422214 -0.132022 0 0 0.000000
31 -0.280294 9.770575 0.036972 0 0 0.000000
32 -0.691827 9.080614 -0.070616 0 0 0.000000
33 -0.321549 10.590952 0.166326 0 0 0.000000
34 -0.155032 9.512700 -0.101809 0 0 0.000000
35 -0.052168 9.782063 0.028316 0 0 0.000000
36 0.951612 7.770738 -0.205614 0 0 0.000000
37 0.725763 11.399600 0.466991 0 1 0.000000
38 0.793190 10.718898 -0.059713 0 0 0.000000
39 2.080533 10.018371 -0.065354 0 0 0.000000
40 2.890031 11.147476 0.112703 0 1 0.000000
41 1.563451 9.316422 -0.164257 0 0 0.000000
42 -0.358973 8.371704 -0.101404 0 0 0.000000
43 -0.042927 10.443485 0.247474 0 0 0.000000
44 0.438877 9.734462 -0.067891 0 0 0.000000
45 -1.369854 10.522775 0.080982 1 0 0.209301
46 -1.260196 11.873045 0.128319 1 1 0.209301
47 -0.012729 9.267605 -0.219442 0 0 0.000000
48 0.603590 9.903244 0.068587 0 0 0.000000
49 -0.120610 10.041392 0.013950 0 0 0.000000
50 -1.372217 9.180103 -0.085774 1 0 0.175492
51 -0.068498 9.098150 -0.008927 0 0 0.175492
52 0.125410 9.040123 -0.006378 0 0 0.175492
53 0.874425 8.371506 -0.073961 0 0 0.175492
54 -0.442511 9.988561 0.193162 0 0 0.175492
55 -0.621700 9.558664 -0.043039 0 0 0.175492
56 -1.120992 11.474312 0.200410 1 1 0.175492
57 -1.458678 10.187339 -0.112161 1 0 0.020994
58 0.114353 9.922086 -0.026038 0 0 0.020994
59 -0.254377 8.413193 -0.152074 0 0 0.020994
60 -0.754028 10.927496 0.298852 0 0 0.020994
61 0.028300 11.063152 0.012414 0 1 0.020994
62 -1.768822 11.431818 0.033324 1 1 0.033324
63 0.742857 9.080574 -0.205675 0 0 0.000000
64 -1.040765 9.573166 0.054247 1 0 0.318988
65 -0.510999 10.168282 0.062165 0 0 0.318988
66 0.438022 10.404902 0.023270 0 0 0.318988
67 -0.359517 9.077677 -0.127558 0 0 0.318988
68 -0.730178 8.614026 -0.051076 0 0 0.318988
69 -0.449042 10.532863 0.222757 0 0 0.318988
70 1.003614 9.469792 -0.100929 0 0 0.318988
71 1.386934 11.705711 0.236111 0 1 0.318988
72 0.296448 11.640889 -0.005538 0 1 0.000000
73 -0.549103 11.163379 -0.041020 0 1 0.000000
74 -0.389459 10.764772 -0.035707 0 0 0.000000
75 -0.857943 8.109918 -0.246624 0 0 0.000000
76 -0.623137 10.847840 0.337602 0 0 0.000000
77 -0.397094 10.518781 -0.030334 0 0 0.000000
78 -0.874785 9.509236 -0.095975 0 0 0.000000
79 0.930555 9.490074 -0.002015 0 0 0.000000
80 -0.800944 8.843778 -0.068102 0 0 0.000000
81 -0.374924 9.377653 0.060367 0 0 0.000000
82 -0.509361 9.875481 0.053087 0 0 0.000000
83 -0.063740 10.099272 0.022661 0 0 0.000000
84 0.254932 8.432778 -0.165011 0 0 0.000000
85 -2.104728 8.578837 0.017320 1 0 0.290243
86 -0.355120 9.887398 0.152534 0 0 0.290243
87 -1.911956 9.824299 -0.006382 1 0 0.290243
88 -0.116750 10.173219 0.035516 0 0 0.290243
89 1.389123 10.281146 0.010609 0 0 0.290243
90 -0.441094 10.642809 0.035177 0 0 0.290243
91 0.686238 10.029521 -0.057625 0 0 0.290243
92 0.983907 8.541020 -0.148412 0 0 0.290243
93 -0.486963 9.527907 0.115547 0 0 0.290243
94 0.250544 10.741613 0.127384 0 0 0.290243
95 0.929991 9.916856 -0.076781 0 0 0.290243
96 0.674956 9.617204 -0.030216 0 0 0.290243
97 -0.533244 7.702358 -0.199106 0 0 0.290243
98 1.973100 10.727567 0.392764 0 0 0.290243
99 1.460717 9.889898 -0.078086 0 0 0.290243
