In this dataframe I need to select records which UserID-s appear in the dataset x times, where 2 < x < 4:
d = {"UserId":[1,2,2,3,3,3,4,4,4,4],"review":["a","b","c","d","e","f","g","h","i","k"]}
f = pd.DataFrame(d)
UserId review
0 1 a
1 2 b
2 2 c
3 3 d
4 3 e
5 3 f
6 4 g
7 4 h
8 4 i
9 4 k
Selecting records with one condition works:
f[f.groupby("UserId")["UserId"].transform('size') > 2]
UserId review
3 3 d
4 3 e
5 3 f
6 4 g
7 4 h
8 4 i
9 4 k
Can' solve it with number of UserID from interval. This does not work:
def check_size(x):
return 2 < len(x) < 4
f['cnt'] = f.groupby('UserID')['UserID'].transform(check_size('size'))
Fails with:
...
...
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
860 in_axis, level, gpr = False, gpr, None
861 else:
--> 862 raise KeyError(gpr)
863 elif isinstance(gpr, Grouper) and gpr.key is not None:
864 # Add key to exclusions
KeyError: 'UserID'
CodePudding user response:
Use between:
out = f[f.groupby("UserId")["UserId"].transform('size')
.between(2, 4, inclusive='neither')]
print(out)
# Output
UserId review
3 3 d
4 3 e
5 3 f
Update
How to add cnt column, so f['cnt'] has count of Id occurrences?
out = f.assign(cnt=f.groupby("UserId")["UserId"].transform('size')) \
.loc[lambda x: x['cnt'].between(2, 4, inclusive='neither')]
# OR
out = f.assign(cnt=f.groupby("UserId")["UserId"].transform('size')) \
.query("cnt.between(2, 4, inclusive='neither')")
Output:
>>> out
UserId review cnt
3 3 d 3
4 3 e 3
5 3 f 3
CodePudding user response:
Use between for selection between 2 values:
f[f.groupby('UserID')['UserID'].transform('size').between(3,5)]
Output:
UserID review
3 3 d
4 3 e
5 3 f
6 4 g
7 4 h
8 4 i
9 4 k
