Sampling without replacement with unequal weights-CodePudding

I am trying to find an algorithm that selects 12 elements from a list of 25 according to the probability of each one of them. These elements cannot be repeated.

I've tried using the function without replacement, but I can't get it to keep the weights when I do 1000 iterations to check that it was done correctly. I have only succeeded if I use the function with replace = True.

Since that function returns duplicate items to me, then I clean the list removing the duplicates. By doing this, the probability that I get each number is not the one that I had defined at the beginning. I am aware that the error is here and that I should be using the function with replace = False, but I can't get it to give me the result.

import numpy as np
from collections import Counter 

nameList = ['AAAAAA', 'B', 'C', 'D','E','F','G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P','Q','R','S','T','U','V','W','X','ZZZZZZZZZZZZZZZZ']
probability_nameList = [0.10, 0.01, 0.02, 0.03, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04]

sampleNames =[]
sampleNamesControl =[]
cleaned_result = []
test_list = []

for i in range(1000):
    sampleNames  = np.random.choice(nameList, 100,replace=True, p=probability_nameList)
    for item in sampleNames:    
        if not item in cleaned_result:
            cleaned_result  = [item] 
    sampleNames =[]
    test_list  = cleaned_result[0:12]
    cleaned_result = []

print(Counter(test_list))

Response:

Counter({'AAAAAA': 832, 'F': 522, 'X': 513, 'ZZZZZZZZZZZZZZZZ': 506, 'I': 505, 'L': 504, 'N': 501, 'S': 499, 'T': 498, 'U': 496, 'R': 492, 'O': 491, 'J': 489, 'P': 488, 'E': 487, 'V': 485, 'K': 482, 'Q': 479, 'H': 473, 'G': 471, 'M': 468, 'W': 461, 'D': 404, 'C': 297, 'B': 157})

As you can see, AAAAA had to be 10 times bigger than B, and that is not happening. P(AAAAA)= 10xP(B)

Thanks in advance for your help!

CodePudding user response：

This is not a programming issue but fundamental maths: what you want is not possible.

You need to take into account conditional probabilities.

I'll give you a small example. Let's take 4 letters A,B,C,D and pick 2 of them, without replacement, with A having a 97% probability of being picked and the others 1%.

Given A's high probability, it will be picked most of the time. But once we have picked it, as we are without replacement, there is no more A and a 1/3 chance to pick one of the other 3 values.

Indeed:

from itertools import chain
from collections import Counter

Counter(chain.from_iterable(np.random.choice(list('ABCD'),
                                             2, replace=False,
                                             p=[.97, .01, .01, .01])
                            for i in range(10000)))

Output:

Counter({'A': 9995, 'C': 3358, 'B': 3317, 'D': 3330})

A is picked almost all the time and the others are picked about one third of the time.

NB. The end frequency of B,C,D is roughly 1/3 because the probability of A is close to 1, if p(A) was smaller, the calculation of the final frequencies would be a bit more complex. The final frequency of B (or C or D) is 0.97*1/3 0.01*(0.01/0.99)*2 0.01 so a bit above 1/3, and that of A is 0.97 0.01*(0.97/0.99)*3 so ~0.9994

CodePudding user response：

If you don't mind popping off one at at a time, you can use random.choices instead of Numpy's random functions:

from random import choices

nameList = ['AAAAAA', 'B', 'C', 'D','E','F','G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P','Q','R','S','T','U','V','W','X','ZZZZZZZZZZZZZZZZ']

probability_nameList = [0.10, 0.01, 0.02, 0.03, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04]

tuples = list(zip(nameList, probability_nameList))

results = []
for _ in range(12):
    element = choices(*list(zip(*tuples)))
    tuples = [tup for tup in tuples if tup[0] != element]
    results.append(element)

print(results)

# [['AAAAAA'], ['G'], ['X'], ['U'], ['R'], ['U'], ['K'], ['D'], ['Q'], ['AAAAAA'], ['S'], ['AAAAAA']]

The above works by making element/probability pairs like ('AAAAAA', 0.01), ('B', 0.01), ... We call random.choices to get one element, using the second part of the pairs as the probability. Then we remove the corresponding tuple from the list and do it again. Results are appended to a list for output.