Home > Net >  Random sample from an IEnumerable generated by yielding elements?
Random sample from an IEnumerable generated by yielding elements?

Time:01-13

I have a method that yields an IEnumerable, but it uses the yield keyword to return elements when executed. I don't always know how big the total collection is. It's kind of similar to the standard Fibonacci example when you go to Try .NET except that it will yield a finite number of elements. That being said, because there's no way of knowing how many elements it will return beforehand, it could keep yielding pretty much forever if there are too many.

When I looked for other questions about this topic here, one of the answers provided a clean LINQ query to randomly sample N elements from a collection. However, the assumption here was that the collection was static. If you go to the Try .NET website and modify the code to use the random sampling implementation from that answer, you will get an infinite loop.

public static void Main()
{
    foreach (var i in Fibonacci().OrderBy(f => Guid.NewGuid()).Take(20))
    {
        Console.WriteLine(i);
    }
}

The query tries to order all the elements in the returned IEnumerable, but to order all the elements it must first calculate all the elements, of which there are an infinite number, which means it will keep going on and on and never return an ordered collection.

So what would be a good strategyfor randomly sampling an IEnumerable with an unknown number of contained elements? Is it even possible?

CodePudding user response:

If a sequence is infinite in length, you can't select N elements from it in less than infinite time where the chance of each element in the sequence being selected is the same.

However it IS possible to select N items from a sequence of unknown, but finite, length. You can do that using reservoir sampling.

Here's an example implementation:

/// <summary>
/// This uses Reservoir Sampling to select <paramref name="n"/> items from a sequence of items of unknown length.
/// The sequence must contain at least <paramref name="n"/> items.
/// </summary>
/// <typeparam name="T">The type of items in the sequence from which to randomly choose.</typeparam>
/// <param name="items">The sequence of items from which to randomly choose.</param>
/// <param name="n">The number of items to randomly choose from <paramref name="items"/>.</param>
/// <param name="rng">A random number generator.</param>
/// <returns>The randomly chosen items.</returns>

public static T[] RandomlySelectedItems<T>(IEnumerable<T> items, int n, System.Random rng)
{
    // See http://en.wikipedia.org/wiki/Reservoir_sampling for details.

    var result = new T[n];
    int index = 0;
    int count = 0;

    foreach (var item in items)
    {
        if (index < n)
        {
            result[count  ] = item;
        }
        else
        {
            int r = rng.Next(0, index   1);

            if (r < n)
                result[r] = item;
        }

          index;
    }

    if (index < n)
        throw new ArgumentException("Input sequence too short");

    return result;
}

This must still iterate over the entire sequence, however, so it does NOT work for an infinitely long sequence.

CodePudding user response:

You need to specify how many items you like to randomly chose from. You can do this by moving the Take before the OrderBy.

Fibonacci().Take(1000).OrderBy(f => Guid.NewGuid()).Take(20);

Randomize the first 1k items and then pick 20 from them for example.

  •  Tags:  
  • Related