In Pandas I am trying create a column that counts all previous rows based on two column values.
The idea is to sort the data by animal and by date and then use pandas shift function to calculate the variance between an animals current date and previous date(the sample data below already has this done).
Then I want to create a new column called Previous_Count that is a rolling count of previous records based on value in Animal column and resets if the Date_Variance field is larger than 10.
Sample Data Code:
import pandas as pd
data = {'Date': ['2021-01-01','2021-01-05','2021-01-10','2021-01-11','2021-01-04','2021-01-08','2021-01-20','2021-01-21','2021-01-25','2021-02-12','2021-02-14'],
'Date_Variance': [0,4,5,1,0,4,0,1,4,18,2],
'Animal': ['Dog','Dog','Dog','Dog','Cat','Cat','Lizard','Lizard','Lizard','Lizard','Lizard'],
#'Previous_count': [0,1,2,3,0,1,0,1,2,0,1]
}
df = pd.DataFrame(data)
Sample Data Output with Previous_Count column:
Date Date_Variance Animal Previous_count
1/1/2021 0 Dog 0
1/5/2021 4 Dog 1
1/10/2021 5 Dog 2
1/11/2021 1 Dog 3
1/4/2021 0 Cat 0
1/8/2021 4 Cat 1
1/20/2021 0 Lizard 0
1/21/2021 1 Lizard 1
1/25/2021 4 Lizard 2
2/12/2021 18 Lizard 0
2/14/2021 2 Lizard 1
CodePudding user response:
Try:
grouper = df.groupby("Animal")["Date_Variance"].apply(lambda x: x.gt(10).cumsum())
df["Previous_count"] = df.groupby(["Animal", grouper]).cumcount()
>>> df
Date Date_Variance Animal Previous_count
0 2021-01-01 0 Dog 0
1 2021-01-05 4 Dog 1
2 2021-01-10 5 Dog 2
3 2021-01-11 1 Dog 3
4 2021-01-04 0 Cat 0
5 2021-01-08 4 Cat 1
6 2021-01-20 0 Lizard 0
7 2021-01-21 1 Lizard 1
8 2021-01-25 4 Lizard 2
9 2021-02-12 18 Lizard 0
10 2021-02-14 2 Lizard 1
CodePudding user response:
Use GroupBy.cumcount. Here we use Series.cumsum to get a new group whenever Date_Variance is greater than 10
df['Previous_count'] = df.groupby(['Animal',
df['Date_Variance'].gt(10).cumsum()]).cumcount()
print(df)
Date Date_Variance Animal Previous_count
0 2021-01-01 0 Dog 0
1 2021-01-05 4 Dog 1
2 2021-01-10 5 Dog 2
3 2021-01-11 1 Dog 3
4 2021-01-04 0 Cat 0
5 2021-01-08 4 Cat 1
6 2021-01-20 0 Lizard 0
7 2021-01-21 1 Lizard 1
8 2021-01-25 4 Lizard 2
9 2021-02-12 18 Lizard 0
10 2021-02-14 2 Lizard 1
print(df['Date_Variance'].gt(10).cumsum())
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 1
10 1
Name: Date_Variance, dtype: int64
