really struggling with this pandas task for a research project.
I have a dataframe df that has two columns: time (a datetime column) and result (a boolean column). I want to count the streak of TRUE rows immediately preceding the current row with a look-back window of 7 days.
For example:
If the preceding row is false, then the count is 0
If the preceding row is true, then I want to know what the streak of true rows was within the 7 day period preceding that row.
Example of expected output below.
| time | result | DESIRED OUTPUT |
|---|---|---|
| 5/1/21 | TRUE | 0 (no preceding rows) |
| 5/6/21 | TRUE | 1 |
| 5/8/21 | FALSE | 2 (immediately preceded by streak of 2 TRUE rows in past 7 days) |
| 5/10/21 | FALSE | 0 |
| 5/11/21 | TRUE | 0 |
| 5/14/21 | TRUE | 1 (preceding row is TRUE) |
| 5/20/21 | TRUE | 1 (immediately preceded by streak of one TRUE rows in 1 week window) |
| 5/21/21 | TRUE | 2 (immediately preceded by streak of two TRUE rows in 1 week window) |
| 5/22/21 | TRUE | 2 (immediately preceded by streak of two TRUE rows in 1 week window) |
| 5/23/21 | FALSE | 3 (immediately preceded by streak of three TRUE rows in 1 week window) |
| 5/24/21 | TRUE | 0 (preceded by FALSE row) |
| 5/26/21 | TRUE | 1 (immediately preceded by streak of 1 TRUE row) |
I have been scouring Stack Overflow and racking my brain for days but just can't figure out a way to do this well. The trick with shift and groupby, e.g. df * (df.groupby((df != df.shift()).cumsum()).cumcount()) would work perfectly except it doesn't take into account the 7-day lookback window and the data is sampled irregularly, so I can't make assumptions on how many rows appear there will be in a 7-day period.
Thank you all so much for your time and help!
CodePudding user response:
I believe you were on the right track. This 
Please notice I converted your date column into a datetime index, which I think is required for this to work. You could always make a small, temporary dataframe to do this, if you don't want to convert the whole thing.
Using sum() works because summing a boolean column gives you the number of True values in it.
