I need to sort my df by month with the cumulative sum for each user (about 5 or 6). Each line is a different log entry by the user, so users may have multiple entries on the same day.
USER DATE
1 user1 2021-04-15
4 user5 2021-04-15
5 user3 2021-04-15
6 user1 2021-04-15
14 user2 2021-04-16
... ... ...
2227 user4 2021-12-30
2228 user5 2021-12-30
2229 user3 2021-12-30
2230 user2 2021-12-30
2231 user1 2021-12-30
I would like to get something like this
MONTH USER CUMSUM
1 2021-04 user1 3
2 2021-04 user2 5
3 2021-04 user3 2
4 2021-04 user4 0
5 2021-04 user5 1
... ... ... ...
n 2021-12 user1 232
n 1 2021-12 user2 124
n 2 2021-12 user3 152
n 3 2021-12 user4 312
n 4 2021-12 user5 218
The objective is to later graph the cumulative sum by month for each user. I have a code that is already working but had to iterate on the df and count each entry for each month on a dict. Probably not the most efficient way. I tried using cumsum and groupby but so far without success.
CodePudding user response:
how about this one-liner :
df.groupby([pd.Grouper(key='DATE', freq='M'), 'USER'])['USER'].count().groupby(['USER']).cumsum()
CodePudding user response:
You can use pandas Grouper or more typically written pd.Grouper for the month, but you have to set an index if you don't already have one.
df.set_index('DATE').groupby([pd.Grouper(freq = 'M'),'USER']).sum()
CodePudding user response:
df['MONTH'] = pd.to_datetime(df['DATE'], format='%Y-%m-%d') # getting DATE to datetime
df['MONTH'] = df['MONTH'].apply(lambda x: x.strftime("%Y-%m")) # applying your format
df['count'] = 1 # adding a count column for cumsum()
df_try = df.groupby(['USER', 'MONTH']).sum().groupby(level=0).cumsum() # groupby and cumsum
