Pandas - Equal occurrences of unique type for a column-CodePudding

I have a Pandas DF called “DF”. I would like to sample data from the population in such a way that, given a occurrence count, N = 100 and column = "Type", I would like to print a total of 100 rows from that column in such a way that the distribution of occurrences of each type is equal.

SNo	Type	Difficulty
1	Single	5
2	Single	15
3	Single	4
4	Multiple	2
5	Multiple	14
6	None	7
7	None	4323

For instance, If I specify N = 3, the output must be :

SNo	Type	Difficulty
1	Single	5
3	Multiple	4
6	None	7

If for the number N, the occurrences of certain types do not meet the minimum split, I can randomly increase another count.

I am wondering on how to approach this programmatically. Thanks!

CodePudding user response：

Use groupby.sample (pandas ≥ 1.1) with N divided by the number of types.

NB. This assumes the N is a multiple of the number of types if you want a strict equality.

N = 3
N2 = N//df['Type'].nunique()

out = df.groupby('Type').sample(n=N2)

handling non multiple of the number of types

Use the same as above and complete to N with random rows excluding those already selected.

N = 5
N2, R = divmod(N, df['Type'].nunique())

out = df.groupby('Type').sample(n=N2)

out = pd.concat([out, df.drop(out.index).sample(n=R)])

As there is still a chance that you complete with items of the same group, if you really want to ensure sampling from different groups replace the last step with:

out = pd.concat([out, df.drop(out.index).groupby('Type').sample(n=1).sample(n=R)]

Example output:

   SNo      Type  Difficulty
4    5  Multiple          14
6    7      None        4323
2    3    Single           4
3    4  Multiple           2
5    6      None          14