I have the following dataframe
import pandas as pd
import numpy as np
import datetime
import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','b','b','b','c','c', 'd'],
'buy': [datetime.date(2020,4,10), datetime.date(2020,4,10), datetime.date(2020,5,21),
datetime.date(2020,8,28), datetime.date(2020,8,28), datetime.date(2021,2,25),
datetime.date(2021,12,1), datetime.date(2021,12,1), datetime.date(2021,12,17)],
'use': [None, None, None,
datetime.date(2020,8,30), None, datetime.date(2021,3,4),
datetime.date(2020,7,8), datetime.date(2021,12,20), None]})
I would like to create a rank column, which will be grouped by id and will be ascending first by buy and then by use. If the use is None and the buy is the same, then the same rank should be given.
I tried:
foo['buy'] = pd.to_datetime(foo['buy'])
foo["rank"] = foo.groupby("id")[["buy", "use"]].rank(method="dense", ascending=True)
The above code works as expected for all ids except id=c
Any ideas why ?
UPDATE
The expected output should be:
import pandas as pd
import numpy as np
import datetime
import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','b','b','b','c','c', 'd'],
'buy': [datetime.date(2020,4,10), datetime.date(2020,4,10), datetime.date(2020,5,21),
datetime.date(2020,8,28), datetime.date(2020,8,28), datetime.date(2021,2,25),
datetime.date(2021,12,1), datetime.date(2021,12,1), datetime.date(2021,12,17)],
'use': [None, None, None,
datetime.date(2020,8,30), None, datetime.date(2021,3,4),
datetime.date(2020,7,8), datetime.date(2021,12,20), None],
'rank': [1,1,2,1,1,2,1,2,1]})
CodePudding user response:
If I understood correctly your comments, within an id/buy group, any None makes the values have the same rank.
Thus, a first step is to mask all use values if there is any None within this group. Then, to rank on several columns you need to aggregate as tuple:
foo['rank'] = (foo
.assign(use2=foo.groupby(['id', 'buy'])['use']
.transform(lambda s: '' if s.isna().any() else s)
)
[['buy', 'use2']].apply(tuple, axis=1)
.groupby(foo['id']).rank(method='dense', na_option='top')
)
output:
id buy use rank
0 a 2020-04-10 None 1
1 a 2020-04-10 None 1
2 a 2020-05-21 None 2
3 b 2020-08-28 2020-08-30 1
4 b 2020-08-28 None 1
5 b 2021-02-25 2021-03-04 2
6 c 2021-12-01 2020-07-08 1
7 c 2021-12-01 2021-12-20 2
8 d 2021-12-17 None 1
CodePudding user response:
Not sure if there is a more efficient way, but this works:
foo['buy'] = pd.to_datetime(foo['buy'])
foo["rank"] = foo.groupby(["id"])[["buy", "use"]].rank(method="dense", ascending=True)
foo["rank2"] = foo.groupby(["id"])[["buy", "use"]].rank(method="first", ascending=True)
foo = foo.merge(foo.groupby('id')['use'].apply(lambda x: x.isnull().any()).to_frame('BOOL').reset_index(), on='id')
foo['rank_f'] = np.where(foo['BOOL'], foo['rank'], foo['rank2'])
And rank_f is the correct column
