I have a Pandas DF called “DF”. I would like to sample data from the population in such a way that, given a occurrence count, N = 100 and column = "Type", I would like to print a total of 100 rows from that column in such a way that the distribution of occurrences of each type is equal.
| SNo | Type | Difficulty |
|---|---|---|
| 1 | Single | 5 |
| 2 | Single | 15 |
| 3 | Single | 4 |
| 4 | Multiple | 2 |
| 5 | Multiple | 14 |
| 6 | None | 7 |
| 7 | None | 4323 |
For instance, If I specify N = 3, the output must be :
| SNo | Type | Difficulty |
|---|---|---|
| 1 | Single | 5 |
| 3 | Multiple | 4 |
| 6 | None | 7 |
If for the number N, the occurrences of certain types do not meet the minimum split, I can randomly increase another count.
I am wondering on how to approach this programmatically. Thanks!
CodePudding user response:
Use groupby.sample (pandas ≥ 1.1) with N divided by the number of types.
NB. This assumes the N is a multiple of the number of types if you want a strict equality.
N = 3
N2 = N//df['Type'].nunique()
out = df.groupby('Type').sample(n=N2)
handling non multiple of the number of types
Use the same as above and complete to N with random rows excluding those already selected.
N = 5
N2, R = divmod(N, df['Type'].nunique())
out = df.groupby('Type').sample(n=N2)
out = pd.concat([out, df.drop(out.index).sample(n=R)])
As there is still a chance that you complete with items of the same group, if you really want to ensure sampling from different groups replace the last step with:
out = pd.concat([out, df.drop(out.index).groupby('Type').sample(n=1).sample(n=R)]
Example output:
SNo Type Difficulty
4 5 Multiple 14
6 7 None 4323
2 3 Single 4
3 4 Multiple 2
5 6 None 14
