Home > Software design >  Pandas: Groupby and select records which IDs appear x times, where n < x < N
Pandas: Groupby and select records which IDs appear x times, where n < x < N

Time:01-28

In this dataframe I need to select records which UserID-s appear in the dataset x times, where 2 < x < 4:

d = {"UserId":[1,2,2,3,3,3,4,4,4,4],"review":["a","b","c","d","e","f","g","h","i","k"]}
f = pd.DataFrame(d)

UserId review
0       1      a
1       2      b
2       2      c
3       3      d
4       3      e
5       3      f
6       4      g
7       4      h
8       4      i
9       4      k 

Selecting records with one condition works:

f[f.groupby("UserId")["UserId"].transform('size') > 2]

    UserId  review
3   3   d
4   3   e
5   3   f
6   4   g
7   4   h
8   4   i
9   4   k

Can' solve it with number of UserID from interval. This does not work:

def check_size(x):
    return 2 < len(x) < 4

f['cnt'] = f.groupby('UserID')['UserID'].transform(check_size('size'))

Fails with:

...
...   
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
        860                 in_axis, level, gpr = False, gpr, None
        861             else:
    --> 862                 raise KeyError(gpr)
        863         elif isinstance(gpr, Grouper) and gpr.key is not None:
        864             # Add key to exclusions
    
    KeyError: 'UserID'

CodePudding user response:

Use between:

out = f[f.groupby("UserId")["UserId"].transform('size')
         .between(2, 4, inclusive='neither')]
print(out)

# Output
   UserId review
3       3      d
4       3      e
5       3      f

Update

How to add cnt column, so f['cnt'] has count of Id occurrences?

out = f.assign(cnt=f.groupby("UserId")["UserId"].transform('size')) \
       .loc[lambda x: x['cnt'].between(2, 4, inclusive='neither')]

# OR

out = f.assign(cnt=f.groupby("UserId")["UserId"].transform('size')) \
       .query("cnt.between(2, 4, inclusive='neither')")

Output:

>>> out
   UserId review  cnt
3       3      d    3
4       3      e    3
5       3      f    3

CodePudding user response:

Use between for selection between 2 values:

f[f.groupby('UserID')['UserID'].transform('size').between(3,5)]

Output:

   UserID review
3       3      d
4       3      e
5       3      f
6       4      g
7       4      h
8       4      i
9       4      k
  •  Tags:  
  • Related