The Daily WTF: Curious Perversions in Information Technology

Earl Purple · 2007-01-04 Reply Admin

Even if you didn't know about these extra options, the answer would still be to pick a number between 0 and the number of elements, then choose the nth element according to what number was randomly picked.

aakoch · 2007-01-04 Reply Admin

I think another WTF is the naming of the variables!

$int = array();

mraymondus · 2007-01-04 Reply Admin

I read your comment at the bottom and that got me to thinking... I've worked for a big software (rdbms) company for almost 8 years and another one for 7 before that. Most of my tasks involve business/finance logic so I cannot ever recall needing to do random access like this. I popped up a db session and started trying to figure out how to get a random row from a table. When the syntax 'RANDOM()' didn't work, I did what all self-respecting developers will do.. I googled it.

Here's an excellent summary of how to do this on several of the more popular rdbms products:

http://www.petefreitag.com/item/466.cfm

If I should ever have to randomly draw data from a table, at least now I'll know a shortcut. Thanks for the healthy diversion from my otherwise bland day.

2007-01-04 Reply Admin

SELECT * FROM bla ORDER BY RAND() LIMIT 1;

?

Derrick Pallas · 2007-01-04 Reply Admin

Thanks for the link, I know that getting random rows isn't trivial. Rather that enumerate the number of different things that were possible, I went to that page and chose one (PostgreSQL) and ran with it. Normally I use MySQL (or SQLite) but I liked the keyword being spelt out as RANDOM().

2007-01-04 Reply Admin

A little more complex in oracle

select * from

(select id from mytable order by dbms_random.value)

where rownum = 1

2007-01-04 Reply Admin

EXACTLY what i was gonna say.

qbolec · 2007-01-04 Reply Admin

It seems to me that MySQL 4.0 is stupid enough to actually perform full sort before fetching the one item:

mysql> select count(1) from friends;
+----------+
| count(1) |
+----------+
| 15996058 |
+----------+
1 row in set (0.06 sec)

So that's why I'll stick to:

list($minrand,$maxrand)=sqlrow("SELECT Min(ID),Max(ID) FROM friends");
$choselimit=3;
for($i=0;$i<$choselimit;$i++){
  $retry=100;
  while($retry--){
   $r=$minrand+((rand()*1000000+rand()*100+rand())% ($maxrand-$minrand));
   $ask="
SELECT *
FROM friends
WHERE ID>='$r'
LIMIT 1
   ";
   $q=mysql_query($ask);
   if(mysql_num_rows($q)==1)
    break;
  }
}

It looks lame, but I don't see other option to deal with incoherent set of ID's

qbolec · 2007-01-04 Reply Admin

After half an hour I decided to :

qbolec@vanikomp:~$ mysqladmin -uXXX -pXXXX kill 118

qbolec · 2007-01-04 Reply Admin

..and I have no idea why didn't I use rand($minrand,$maxrand)..

Whiskey Tango Foxtrot? Over. · 2007-01-04 Reply Admin

I didn't have the heart to tell her about SELECT * FROM table ORDER BY RANDOM() LIMIT 1 or the variants thereof.

So, rather than tell her privately how to improve her code, you post it publicly, potentially humiliating her? Yeah, you're right... you don't have a heart! :D

2007-01-04 Reply Admin

qbolec:
So that's why I'll stick to:
list($minrand,$maxrand)=sqlrow("SELECT Min(ID),Max(ID) FROM friends");
$choselimit=3;
for($i=0;$i<$choselimit;$i++){
  $retry=100;
  while($retry--){
   $r=$minrand+((rand()*1000000+rand()*100+rand())% ($maxrand-$minrand));
   $ask="
SELECT *
FROM friends
WHERE ID>='$r'
LIMIT 1
   ";
   $q=mysql_query($ask);
   if(mysql_num_rows($q)==1)
    break;
  }
}
It looks lame, but I don't see other option to deal with incoherent set of ID's

Wow! That's a terrible way to do it! :) Any time you have to retry an operation an arbitrary number of times to get it to work, you're probably doing something wrong.

By the way, you should make sure your tables are indexed, because it sounds like maybe they're not. Second, of course MySQL has to comapre every row -- you're telling it to ORDER BY random(). How could it know which row comes first unless it gets a random number for each row?

Now, a better solution would be:

SELECT count(*) FROM foo

and save the result as num_rows. Then

SELECT * FROM foo LIMIT [random number between 0 and num_rows],1

you just selected a random row in two quick queries! congrats!

savar · 2007-01-04 Reply Admin

mraymondus:

http://www.petefreitag.com/item/466.cfm
If I should ever have to randomly draw data from a table, at least now I'll know a shortcut. Thanks for the healthy diversion from my otherwise bland day.

Those are all going to blow up in your face if you have a database with more than a few tens of thousands of rows. In each case you generating a RANDOM (slow), then sorting the entire table by that generated value (slower), and then throwing away the entire tail of the list just to return the head (stupid!).

Somebody in the comments of the page you linked to noticed the same and provided this link:

http://www.greggdev.com/web/articles.php?id=6

It's a little better. Now you're only creating one random but you're still doing multiple full table scans.

The best way? Keep your own row ID (a unique index, separate from the primary key) that is guaranteed (by using stored procedures to manipulate the table data) to always be dense and range from 1 to ROWCOUNT. Generate a rand(1,ROWCOUNT) and return that row. Now you're not doing any full table scans, and you only used two DML statements.

Storage is cheap, CPU is not.

Frankly, compared to the methods described here, I don't think the submitter's code is that ridiculous.

2007-01-04 Reply Admin

savar:
mraymondus:

http://www.petefreitag.com/item/466.cfm
If I should ever have to randomly draw data from a table, at least now I'll know a shortcut. Thanks for the healthy diversion from my otherwise bland day.
Those are all going to blow up in your face if you have a database with more than a few tens of thousands of rows. In each case you generating a RANDOM (slow), then sorting the entire table by that generated value (slower), and then throwing away the entire tail of the list just to return the head (stupid!).
Somebody in the comments of the page you linked to noticed the same and provided this link:
http://www.greggdev.com/web/articles.php?id=6
It's a little better. Now you're only creating one random but you're still doing multiple full table scans.
The best way? Keep your own row ID (a unique index, separate from the primary key) that is guaranteed (by using stored procedures to manipulate the table data) to always be dense and range from 1 to ROWCOUNT. Generate a rand(1,ROWCOUNT) and return that row. Now you're not doing any full table scans, and you only used two DML statements.
Storage is cheap, CPU is not.
Frankly, compared to the methods described here, I don't think the submitter's code is that ridiculous.

okay smarty pants, how do you do in TSQL using MS SQL 2000. Yes I know that MS SQL 2005 has return set row numbering.

Pingmaster · 2007-01-04 Reply Admin

RogerC:

Anonymous:
Anonymous:
Firs
That really was retarted. Congratulations.
The pot calls the kettle black. It's retarded, not retarted.
Call me a grammar cop if you must, but I just couldn't resist.

But then, one is forced to ask, why are the grammar cops going after spelling errors, isn't that out of their jurisdiction?

2007-01-04 Reply Admin

Select top 1 * from table
order by NewID()

Don't do it tho if the table has more the 10k recs or it will take too long. Has to order the dataset before taking the top 1

2007-01-04 Reply Admin

Earl Purple:
Even if you didn't know about these extra options, the answer would still be to pick a number between 0 and the number of elements, then choose the nth element according to what number was randomly picked.

We have a WTF with this thinking. Or, maybe just my interpretation of this thinking.

If you have all the table results in an array, this would be good, but to randomly pick a number between 0 and the number of elements would be a bad idea with SQL. You're assuming that each record is numbered sequentially, this isn't the bad thing, but you're also assuming that each record is there. What happens when you delete a record? Oops, that won't work.

Besides using the SQL Random methods, the above code isn't all that bad. It is portable. That SQL would work with any database that I know of. Granted if I wanted to go that route I would only select the id column, then pick a random array element, then fetch the record. Maybe I'm different, but I prefer portable code.

If I was a professor, I'd still give her some credit.

qbolec · 2007-01-04 Reply Admin

Anonymous:

Wow! That's a terrible way to do it! :) Any time you have to retry an operation an arbitrary number of times to get it to work, you're probably doing something wrong.

I agree:)

Anonymous:

Second, of course MySQL has to comapre every row -- you're telling it to ORDER BY random(). How could it know which row comes first unless it gets a random number for each row?

Right. But still I was expecting the optimizer to spot that.

Moreover it looks like putting rand() inside your query discourages MySQL from performing any optimizations:

mysql> explain select min(u)+0.241*(max(u)-min(u)) from friends;
+------------------------------+
| Comment |
+------------------------------+
| Select tables optimized away |
+------------------------------+
1 row in set (0.00 sec)

Anonymous:

SELECT * FROM foo LIMIT [random number between 0 and num_rows],1

mysql> select * from friends limit 1000000,1;
+-------+--------+
| u | v |
+-------+--------+
| 47121 | 362635 |
+-------+--------+
1 row in set (10.61 sec)

It seems to me it's linear access time. [it worked much faster for lower number, though there is an index on (u,v) and I haven't specified any particular order]

2007-01-04 Reply Admin

Pingmaster:

But then, one is forced to ask, why are the grammar cops going after spelling errors, isn't that out of their jurisdiction?

Sir or madam, I'm afraid that's a comma splice. I will have to take you into custody and drub you soundly with a truncheon.

2007-01-04 Reply Admin

Stupid MySQL. What's a good way to select n (figure about 20) random records in a random order with no duplicates? This one still confuses me.

2007-01-04 Reply Admin

Can't you "select into" a temporary table, to get those 20 random records?

2007-01-04 Reply Admin

qbolec:

Right. But still I was expecting the optimizer to spot that.
Moreover it looks like putting rand() inside your query discourages MySQL from performing any optimizations:

But, you forget that rand() is a row-level function. Therefore, it *must* be evaluate for every row. Thus, since the value is different for each row, the calculation cannot be (completely) optimized.

2007-01-04 Reply Admin

Some of the previous posts showed horrible solutions that were no better than the original code. Here's how I would do it:

list($min, $max) = mysql_fetch_row(mysql_query('SELECT MIN(id), MAX(id) FROM abc'));
$a = rand($min, $max)
$row = mysql_fetch_row(mysql_query('SELECT * FROM abc WHERE id >= '.$a.' LIMIT 1'));

2007-01-04 Reply Admin

I don't do Mysql as I prefer Postgresql (http://www.postgresql.org) a much more mature, feature-rich and efficient open source database (BSD style).

But generally, I believe the best way is to :

1) select count(*) from table

2) choose a random number 0 - count(*)

3) select * from table order by <primary_key> limit 1 offset <random number>

On posgresql select count(*) is expensive (but accurate), so one might want to keep that value cached in a table (if you have more than a few tens of thousands of rows)

Benefits of this approach:

1) as the data is sorted by the primary key, the data should already be indexed (that is cheap access via index scans)

2) you will always get a row back, as there is no way that row could not be there (if you select where id = <random number>, you risk having a hole there...)

3) if you need more than one, just UNION multiple lines upon each other, the database should still be able to optimize the query into one operation

But really, there is no way of saying up front what is fastest.. you need to analyze the query plans (EXPLAIN is your friend). Any significant query must be profiled before going into production...

mbvlist · 2007-01-04 Reply Admin

qbolec:
Anonymous:

SELECT * FROM foo LIMIT [random number between 0 and num_rows],1
mysql> select * from friends limit 1000000,1;
+-------+--------+
| u | v |
+-------+--------+
| 47121 | 362635 |
+-------+--------+
1 row in set (10.61 sec)
It seems to me it's linear access time. [it worked much faster for lower number, though there is an index on (u,v) and I haven't specified any particular order]

You only have an index on u,v? I'm not sure, please do the explain-thingy, but shouldn't you have an index on u separately for this query?

Either way MySQL can't use the index: the index is sorted, so it can only get number 10.000 pretty quick if it's sorted :)

2007-01-04 Reply Admin

Assuming the goal is a uniform distribution over all rows, this fails horribly if the id's are not evenly distributed.

savar · 2007-01-04 Reply Admin

Anonymous:
Pingmaster:

But then, one is forced to ask, why are the grammar cops going after spelling errors, isn't that out of their jurisdiction?
Sir or madam, I'm afraid that's a comma splice. I will have to take you into custody and drub you soundly with a truncheon.

It was a setup!

2007-01-04 Reply Admin

savar:
Frankly, compared to the methods described here, I don't think the submitter's code is that ridiculous.

Maybe you missed the part where she:

selected the entire table
walked the entire result set to build an array of the IDs and
iterate a counter
walked that array until she hit the id that matched the rand (assuming IDs are not sparse)
and then queried again now that she had (maybe) an ID

qbolec · 2007-01-04 Reply Admin

mbvlist:
qbolec:
Anonymous:

SELECT * FROM foo LIMIT [random number between 0 and num_rows],1
mysql> select * from friends limit 1000000,1;
+-------+--------+
| u | v |
+-------+--------+
| 47121 | 362635 |
+-------+--------+
1 row in set (10.61 sec)
It seems to me it's linear access time. [it worked much faster for lower number, though there is an index on (u,v) and I haven't specified any particular order]
You only have an index on u,v? I'm not sure, please do the explain-thingy, but shouldn't you have an index on u separately for this query?
Either way MySQL can't use the index: the index is sorted, so it can only get number 10.000 pretty quick if it's sorted :)

Well I pretty sure that MySQL arranges index on(u,v) in such a way that first 4 bytes of each entry corresponds to index on(u) - since many plans shown by EXPLAIN contained the (u,v) index, where (u) would be applicable, even when I had both. Due to hard disk shortage I decided to drop the (u) index.

However, it seems totally irrelevant to me, because there is no "ORDER BY" clause in my query, so the index could be used here only to omitt reading rows directly from table - but there should be no difference between reading them from table and from index, since the row contains exactly the same data as the index [but in different data structure - I'm not sure what's the performace impact of cache losses and so on]

What seems important is that MySQL's indexes (B-trees) do not contain positional stats (like size of the subtree etc.) so they do not really help in finding n-th element when the ID's are not continuous.

2007-01-04 Reply Admin

Anonymous:
Assuming the goal is a uniform distribution over all rows, this fails horribly if the id's are not evenly distributed.

Absolutely. It's also a problem if the ID's are not integers. The way I would do it (at the risk of adding another WTF to the pool) is to

Select all the ID's into an array (sorting/indexing is irrelevant).
Pick an ID from the array at random (unlike the ID's, you can rely any integer between 0 and array length-1 to be a valid array index)
Use that ID to select whatever other columns you need.

It scales linearly (fixed # of bytes per index * N rows, no sorting), and it's not RDBMS-specific.

--YY

2007-01-04 Reply Admin

From a database engine point of view, wouldn't the most efficient way just be to randomly traverse the B-tree? I mean, shouldn't you hit a record in less than 2^64 steps each time, and still maintain an even distribution? (which is not the case in some of the suggestions here).

The question is if it's possible to do it in SQL though...

2007-01-04 Reply Admin

whoops, just realized what I wrote.. Of course I didn't mean 2^64 steps, that would be ridiculous. Just 64, as that should be the number required to reach 2^64 records. ;)

2007-01-04 Reply Admin

So far, this is the sanest post out here , as in provides the best solution so far.

Random sampling in SQL seems to be a domain of knowledge itself, this is what I found so far and think it is worth it:

An academic bit (a homework actually)
The quick and practical approach
For those not disgruntled with MS SQL Server
DB2 implements sql'03 ideas of bernoulli and system sampling
Bigger (albeit old) research on topic. The download bit is in top right corner. Citation list shows updated works on the subject

I myself would probably use the practical approach, as it is simple widely available even on feature poor systems, such as MySQL 4.0 and likes.

P.S. A note the the editor: order by random() is not a wise idea :)

2007-01-04 Reply Admin

Pingmaster:
RogerC:

Anonymous:
Anonymous:
Firs
That really was retarted. Congratulations.
The pot calls the kettle black. It's retarded, not retarted.
Call me a grammar cop if you must, but I just couldn't resist.

But then, one is forced to ask, why are the grammar cops going after spelling errors, isn't that out of their jurisdiction?

Yes, but it is in their jurisdictionary!

Thank you, I'll be here all week.

Cap'n Steve · 2007-01-04 Reply Admin

Getting the COUNT(), generating the random number within the program, and then using LIMIT to skip a random number of rows is the only good way to get a random row.

As far as I know, the only way to get more than one row in random order in a single query is to ORDER BY RAND(). I've heard that something like this: "SELECT RAND() AS random, x. FROM x ORDER BY random" is better performance-wise. Anyone know for sure?

Arancaytar · 2007-01-05 Reply Admin

Found her. The person whose PHP code I've been fixing for the past year. >:(

Okay, not really, because I know from his name it was a guy. But seriously, things like that were like his artist signature. Select all the rows from a table, then use a loop to get the one you want.

In some cases, use a loop to retrieve additional data about each row (one connect and query per loop iteration) from other tables. Joins are not for everyone...

2007-01-05 Reply Admin

You could always just generate multiple random numbers and use IN (.. list of numbers ..) or similar. Not elegant, but beats a full table scan for large tables.

Oh, and including RANDOM() in the selected columns should not make it any faster, unless the DBMS is badly designed/has a crap optimiser.

2007-01-05 Reply Admin

Your STILL a rookie baby.. :D

2007-01-05 Reply Admin

On postgresql

SELECT * FROM table ORDER BY RANDOM() LIMIT 1

takes about 50 seconds to execute on a table with 1 mln records

I think I'd go with:

SELECT count(*) FROM table

followed by:

SELECT * FROM table OFFSET $random LIMIT 1

2007-01-05 Reply Admin

Anonymous:
Some of the previous posts showed horrible solutions that were no better than the original code. Here's how I would do it:
list($min, $max) = mysql_fetch_row(mysql_query('SELECT MIN(id), MAX(id) FROM abc'));
$a = rand($min, $max)
$row = mysql_fetch_row(mysql_query('SELECT * FROM abc WHERE id >= '.$a.' LIMIT 1'));

Better methods than this is already mentioned in this thread, so why come up with this bs now? I wont even tell you whats wrong about this, if you dont understand it go do something else is my advice :D

Derrick Pallas · 2007-01-05 Reply Admin

P.S A note to the editor: order by random() is not a wise idea :)

Exactly. Notice that (in her reply) she "fixed" the problem by using array_rand, which suffers from the same problem: reading the entire table to get one row. Making the same mistake in the SQL engine is the next logical step. (Of course, it'd be nice if the optimizer took care of the problem.)

2007-01-05 Reply Admin

Except for the fact that the distribution of IDs might not be uniform and hence prevent a uniform distribution of random results, what's wrong with it? It's clearly the most scalable solution from a practical perspective.

2007-01-05 Reply Admin

Your answer is better then the one in the article. If you do a "order by random()" you will select a random number for each row in the database and sort by this value (unless the db recognizes the idiom and have some sort of quicker way of doing it), just to select the first row of this sorted. If the database is small enough this may not be bad, but if it has millions of rows this could be bad.

Another solution would be to use a where clause instead of a random sorting.

2007-01-05 Reply Admin

> It looks lame, but I don't see other option to deal with incoherent set of ID's

You shouldn't treat IDs as numbers, as they're not intended to do math with. Keys should be informationless.

The Real WTF of your code is that if your IDs are not evenly distributed (for example because some records have been deleted) some records will pop up more frequently than others.

Instead of WHERE ID>='$r', use the row count to create a query that does a LIMIT xxxx,1 (where xxxx is a random number from 1..rowcount):
In pseudocode:

function randomquery($rowcount,$query)
{
$r=$rowcount*rand()+1;
return "$query limit $r,1";
}

// $rowcount=number of rows in table (select count(1) from friends)
$ask=randomquery("select * from friends",$rowcount);

This will return you a random row with equal probability for each record, regardless of their ID.
Also, it can be used on any table (or even joined tables) because it does not require any field names.

2007-01-05 Reply Admin

Derrick Pallas:
P.S A note to the editor: order by random() is not a wise idea :)
Exactly. Notice that (in her reply) she "fixed" the problem by using array_rand, which suffers from the same problem: reading the entire table to get one row. Making the same mistake in the SQL engine is the next logical step. (Of course, it'd be nice if the optimizer took care of the problem.)

For those methods that use RAND to sort the rows, it is not possible for the optimiser to take care of it: it can't know which are the n lowest random values until it has generated them all and sorted them all. I suppose it could keep a list of bookmarks to the n lowest values generated so far, but that would be an optimisation for this specific query.

GeneWitch · 2007-01-05 Reply Admin

I'm going to bookmark this whole thread, as i am going to want to pull random things out of a DB, and i don't want it to be WTFy. Not that i know anything about Relational databases or anything, but i assume once i have the basics down that doing this sort of stuff (with the help here) will be almost trivial.

I just hope i can get the basics down. :-)

2007-01-05 Reply Admin

Anonymous:
Derrick Pallas:

P.S A note to the editor: order by random() is not a wise idea :)
Exactly. Notice that (in her reply) she "fixed" the problem by using array_rand, which suffers from the same problem: reading the entire table to get one row. Making the same mistake in the SQL engine is the next logical step. (Of course, it'd be nice if the optimizer took care of the problem.)
For those methods that use RAND to sort the rows, it is not possible for the optimiser to take care of it: it can't know which are the n lowest random values until it has generated them all and sorted them all. I suppose it could keep a list of bookmarks to the n lowest values generated so far, but that would be an optimisation for this specific query.

Look peeps the real problem here is that relational databases are not optimized to retrieve random data. The primary purposes of a relational database are to record, organize and report on data, to make it LESS random in a sense. The most random thing about relational databases is their widely varying implementations of SQL syntax.

Today's WTF is an example of a poor way to do this on just about any platform, and the suggested solution of ORDER BY RANDOM() may be worse in some situations. The "best" solution to this will most likely be DB platform and domain specific.

Been working with database driven business applications for nearly 20 years and have yet to encounter a situation where random data was desired to be returned. I have needed the occasional random number, but random data? Can anyone give an example where that is useful? Only thing I can think of is a "Quote of the Day" scenario where you are unlikely to have millions of rows . . . So while this is an interesting mental exercise I wonder about the WTF'ery of a business application where random data is desired.

qbolec · 2007-01-05 Reply Admin

Anonymous:

> It looks lame, but I don't see other option to deal with incoherent set of ID's
You shouldn't treat IDs as numbers, as they're not intended to do math with. Keys should be informationless.
The Real WTF of your code is that if your IDs are not evenly distributed (for example because some records have been deleted) some records will pop up more frequently than others.

Instead of WHERE ID>='$r', use the row count to create a query that does a LIMIT xxxx,1 (where xxxx is a random number from 1..rowcount):
In pseudocode:

function randomquery($rowcount,$query)
{
$r=$rowcount*rand()+1;
return "$query limit $r,1";
}

// $rowcount=number of rows in table (select count(1) from friends)
$ask=randomquery("select * from friends",$rowcount);
This will return you a random row with equal probability for each record, regardless of their ID.
Also, it can be used on any table (or even joined tables) because it does not require any field names.

Well I was totaly aware of the problem with anomalies in key distribution. However I was looking for fast solution, as the random thumbnails were supposed to be presentat at each webpage. Therefore I decided to sacrafice the quality of the solution :)

Your solution was already posted, and I have already answered to this approach - it's slow in MySQL 4.0. I don't know why (problably MySQL lacks some statistics in it's B-tree) but the time required to fetch N-th row using LIMIT N,1 is proportional to N (or at least strictly monotone:) )

2007-01-05 Reply Admin

To all you CHOPS who suggest that you SELECT COUNT(*) FROM BLA and then choose a random number between 0 and the result and the select are living/working in a dream world.

What about application servers that generate a seed for the unique id on every restart ?

What about seeding the PK to not start at 0 ?

2007-01-05 Reply Admin

Slightly off topic - Is this also a requirements / analysis WTF.

Too bad we don't have any context for why it needed to be random in the first place.Does it truly need to be random, or just perceived to be random? Is there also requirements that all records should be viewed over time. Should we replace the word 'random' with 'not the same every time'

I'm thinking of the case of banner ads or similar where the client just wants the content to change on subsequent requests. You really wouldn't want to be completely random because you can't guarantee that everything gets viewed and the quality of the random algorithm could bias your results.

You might also have contractual obligations to show content, and if random, you can't assure the business it will be shown (unless of course you can convince the boss that your website will still be running when the universe dies)

Finding Random Rows

Leave a comment on “Finding Random Rows”