I have this dataframe:
type run corrected_episode Reward
0 notsweet 0 0 35.0
1 notsweet 0 100 20.0
2 notsweet 0 200 20.0
3 notsweet 0 300 22.0
4 notsweet 0 400 20.0
I want to create a new column, best_so_far, that has a monotonically increasing value for the corresponding Reward grouped by type, run, and corrected_episode. Easy enough, right? Except the following happens when I try to use groupby and cummax:
foo['best_so_far'] = foo.groupby(['type','run','corrected_episode']).Reward.cummax() yields:
type run corrected_episode Reward best_so_far
0 notsweet 0 0 35.0 35.0
1 notsweet 0 100 20.0 20.0
2 notsweet 0 200 20.0 20.0
3 notsweet 0 300 22.0 22.0
4 notsweet 0 400 20.0 20.0
The "best so far", well, isn't the best. I get the same results if I use foo['best_so_far'] = foo.groupby(['type','run','corrected_episode']).Reward.apply(lambda x: x.cummax())
I know this is possible because I've done this dozens of times with other dataframes, there's just something weird about this one that this simple procedure just doesn't work.
CodePudding user response:
You can try remove corrected_episode
foo['best_so_far'] = foo.groupby(['type','run']).Reward.cummax()
CodePudding user response:
After posting this of course I discovered what happened, but I'm going to share what I did to fix this here because this is the kind of Violation of the Principle of Least Astonishment that pandas is prone to.
The solution was to do this, instead:
foo['best_so_far'] = foo.groupby(['type','run']).Reward.cummax()
That is, I over specified the columns by including corrected_episode that had the unintended effect of just executing cummax() for that one element. However, I had originally included corrected_episode to ensure that the order of the rows was correct -- i.e., the dataframe was actually the result of massaging a lot of data (you are seeing a teeny tiny subset), and the order of the data wasn't necessarily sane for the cummax() to work as I envisioned.
