- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
We won't even mention that the correct way to choose three out of four is to pick a random one of the four and throw it away, leaving the other three as the three out of four.
Oops.
Admin
In general, a random selection of a subset can be done in a few ways, but the ways that work well depend on how large the selection is relative to the size of pool of things you're picking from. Picking 3 of 500 can use the techniques in the article; picking 3 of 4 is better off doing something like shuffling and slicing (or throwing a single value away in that specific case).
Admin
I always loved the shuffle that simply iterates through replacing the current item with another random item.
It looks like it should work, but does not do a fair shuffle.
Admin
Picking N items out of M items could be done with a single random number. No need to generate the random number several times for each of the N items.
Let's say you can generate floating-point numbers between 0 and 1, and that you must pick 5 items out of 19. There are 11628 possible combinations. So you can split the interval of random number generation in 11628 parts, and say that if the random number is in a given part, then it'll result in a particular combination. Of course, do not hardcode the map between all the 11628 intervals and the 11628 combinations. Make a function to generate that.
Admin
Depends if you want combinations (order doesn't matter) or permutations (order matters). The permutation case is significantly more bothersome, and I'd be inclined to implement a proper "select without replacement" algorithm, which is more accurately called "select with deletion" in this case.
Admin
While the first example is poor code (Steve above points out the easy, obvious way to pick 3 out of 4), and it especially bugs me that 4 is hardcoded rather than using the length of the array, I don't actually think it's bad. Sure, there's no absolute upper limit on how long this will take, but it certainly won't take "forever" (theoretically any outcome with a non-zero probability must be generated randomly if you give it enough tries) and actually I reckon the odds of this even taking more than a few seconds are astronomically against.
Now I hope I wouldn't let this pass a code review, because there are obvious ways to do it that aren't this dumb and I hate the thought that anything might take unbounded time, however unlikely. But honestly I think this is everyday poor code rather than TDWTF territory.
Admin
Admin
I think Robin's point is that it's not really a WTF, not that it's not daily.
Admin
It does do a fair shuffle provided each item is only the recipient of a swap once. https://rosettacode.org/wiki/Knuth_shuffle
Admin
The in place Fisher Yates algorithm is very similar but does result in an unbiased sample as long as your random source is unbiased.
In one version, you start with an array of all the items you want to shuffle. Then you pick a random element and swap it with the first one (note you might end up swapping the first element with itself, but not to allow that would be an enigmatic error). Then you select a random element excluding the first one and swap it with the second element. Then you pick a random element excluding the first and second element and swap it with the third element, and so on until you have enough items or you get to the end of the array.
Addendum 2022-09-28 09:40: Ninja'd
Admin
obligatory xkcdDOTcom/221/ and xkcdDOTcom/1277/
Admin
What's the clever solution if you are selecting from a very wide range of numbers that wouldn't fit (conveniently) into memory for you to shuffle? I can only think of a guess-and-check approach, storing selected numbers into sets, but that's just an equivalent approach to the WTF approach shown here.
Admin
That algorithm - as pseudocoded in your comment - still seems biased, since the number that starts in the last position will never be in the last position at the end of the shuffle. The same is true for the item in the first position. I intuit that probabilistically items in the later bins at start will be more likely found in the earlier bins of the shuffled result, since the available bins to put them in shrinks consistently. Also, it doesn't ensure each item will be shuffled only once, but it does ensure it will only be shuffled twice. It only ensures that once something has been substituted in that it won't be moved again and that something can only be substituted out once and in once (max two shuffles).
Admin
Sorry, I should reword that intuition. Later items at start of shuffle should tend to be in earlier bins at end of shuffle, because the number of bins to "pull" them to the right is larger than the number of bins to "push" them to the left -- the number of bins being inversely proportional to the probability that they will be pulled/pushed from a region.
Admin
I was wondering about the biases in the different variations as well until I followed the links from that rosettacode page Ross_Presser linked to the codinghorror post and the wikipedia page. They do a great job explaining the bias inherent in the naive version and the lack thereof in the version Ross_Presser posted.
Admin
Also, for selecting only 3 random out of 500 you can stop after the third iteration.
Admin
This is not true since the random number is chosen such that
0 <= j <= i
so an element can be swapped with itself. If there are 10 elements, indexed 0 to 9, on the first iteration you select a random number in the range 0 to 9 inclusive.Again not true: the algorithm doesn't have to pick the item in position 0 ever.
This algorithm is conceptually the same as having two arrays, removing a random item from the first array and inserting it at the beginning of the second array.