- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Even if you didn't know about these extra options, the answer would still be to pick a number between 0 and the number of elements, then choose the nth element according to what number was randomly picked.
Admin
I think another WTF is the naming of the variables!
Admin
I read your comment at the bottom and that got me to thinking... I've worked for a big software (rdbms) company for almost 8 years and another one for 7 before that. Most of my tasks involve business/finance logic so I cannot ever recall needing to do random access like this. I popped up a db session and started trying to figure out how to get a random row from a table. When the syntax 'RANDOM()' didn't work, I did what all self-respecting developers will do.. I googled it.
Here's an excellent summary of how to do this on several of the more popular rdbms products:
http://www.petefreitag.com/item/466.cfm
If I should ever have to randomly draw data from a table, at least now I'll know a shortcut. Thanks for the healthy diversion from my otherwise bland day.
Admin
SELECT * FROM bla ORDER BY RAND() LIMIT 1;
?
Admin
Thanks for the link, I know that getting random rows isn't trivial. Rather that enumerate the number of different things that were possible, I went to that page and chose one (PostgreSQL) and ran with it. Normally I use MySQL (or SQLite) but I liked the keyword being spelt out as RANDOM().
Admin
A little more complex in oracle
select * from
(select id from mytable order by dbms_random.value)
where rownum = 1
Admin
EXACTLY what i was gonna say.
Admin
It seems to me that MySQL 4.0 is stupid enough to actually perform full sort before fetching the one item:
mysql> select count(1) from friends;
+----------+
| count(1) |
+----------+
| 15996058 |
+----------+
1 row in set (0.06 sec)
mysql> select * from friends order by RAND() limit 1;
Aborted
qbolec@vanikomp:~$ mysqladmin -uXXX -pXXXX processlist
+-----+------+-----------+------+---------+------+------------------------------
+-----------------------------------------------+
| Id | User | Host | db | Command | Time | State
| Info |
+-----+------+-----------+------+---------+------+------------------------------
+-----------------------------------------------+
| 118 | XXX | localhost | test | Query | 53 | Copying to tmp table on disk
| select * from friends order by RAND() limit 1 |
| 119 | XXX | localhost | | Query | 0 |
| show processlist |
+-----+------+-----------+------+---------+------+------------------------------
+-----------------------------------------------+
So that's why I'll stick to:
list($minrand,$maxrand)=sqlrow("SELECT Min(ID),Max(ID) FROM friends");
$choselimit=3;
for($i=0;$i<$choselimit;$i++){
$retry=100;
while($retry--){
$r=$minrand+((rand()*1000000+rand()*100+rand())% ($maxrand-$minrand));
$ask="
SELECT *
FROM friends
WHERE ID>='$r'
LIMIT 1
";
$q=mysql_query($ask);
if(mysql_num_rows($q)==1)
break;
}
}
It looks lame, but I don't see other option to deal with incoherent set of ID's
Admin
After half an hour I decided to :
qbolec@vanikomp:~$ mysqladmin -uXXX -pXXXX kill 118
Admin
..and I have no idea why didn't I use rand($minrand,$maxrand)..
Admin
So, rather than tell her privately how to improve her code, you post it publicly, potentially humiliating her? Yeah, you're right... you don't have a heart! :D
Admin
Wow! That's a terrible way to do it! :) Any time you have to retry an operation an arbitrary number of times to get it to work, you're probably doing something wrong.
By the way, you should make sure your tables are indexed, because it sounds like maybe they're not. Second, of course MySQL has to comapre every row -- you're telling it to ORDER BY random(). How could it know which row comes first unless it gets a random number for each row?
Now, a better solution would be:
SELECT count(*) FROM foo
and save the result as num_rows. Then
SELECT * FROM foo LIMIT [random number between 0 and num_rows],1
you just selected a random row in two quick queries! congrats!
Admin
Those are all going to blow up in your face if you have a database with more than a few tens of thousands of rows. In each case you generating a RANDOM (slow), then sorting the entire table by that generated value (slower), and then throwing away the entire tail of the list just to return the head (stupid!).
Somebody in the comments of the page you linked to noticed the same and provided this link:
http://www.greggdev.com/web/articles.php?id=6
It's a little better. Now you're only creating one random but you're still doing multiple full table scans.
The best way? Keep your own row ID (a unique index, separate from the primary key) that is guaranteed (by using stored procedures to manipulate the table data) to always be dense and range from 1 to ROWCOUNT. Generate a rand(1,ROWCOUNT) and return that row. Now you're not doing any full table scans, and you only used two DML statements.
Storage is cheap, CPU is not.
Frankly, compared to the methods described here, I don't think the submitter's code is that ridiculous.
Admin
okay smarty pants, how do you do in TSQL using MS SQL 2000. Yes I know that MS SQL 2005 has return set row numbering.
Admin
But then, one is forced to ask, why are the grammar cops going after spelling errors, isn't that out of their jurisdiction?
Admin
Select top 1 * from table
order by NewID()
Don't do it tho if the table has more the 10k recs or it will take too long. Has to order the dataset before taking the top 1
Admin
We have a WTF with this thinking. Or, maybe just my interpretation of this thinking.
If you have all the table results in an array, this would be good, but to randomly pick a number between 0 and the number of elements would be a bad idea with SQL. You're assuming that each record is numbered sequentially, this isn't the bad thing, but you're also assuming that each record is there. What happens when you delete a record? Oops, that won't work.
Besides using the SQL Random methods, the above code isn't all that bad. It is portable. That SQL would work with any database that I know of. Granted if I wanted to go that route I would only select the id column, then pick a random array element, then fetch the record. Maybe I'm different, but I prefer portable code.
If I was a professor, I'd still give her some credit.
Admin
I agree:)
Right. But still I was expecting the optimizer to spot that.
Moreover it looks like putting rand() inside your query discourages MySQL from performing any optimizations:
mysql> explain select min(u)+rand()*(max(u)-min(u)) from friends;
+---------+-------+---------------+-------------+---------+------+----------+-------------+
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+---------+-------+---------------+-------------+---------+------+----------+-------------+
| friends | index | NULL | friends3_uv | 8 | NULL | 15996058 | Using index |
+---------+-------+---------------+-------------+---------+------+----------+-------------+
1 row in set (0.04 sec)
mysql> explain select min(u)+0.241*(max(u)-min(u)) from friends;
+------------------------------+
| Comment |
+------------------------------+
| Select tables optimized away |
+------------------------------+
1 row in set (0.00 sec)
mysql> select * from friends limit 1000000,1;
+-------+--------+
| u | v |
+-------+--------+
| 47121 | 362635 |
+-------+--------+
1 row in set (10.61 sec)
It seems to me it's linear access time. [it worked much faster for lower number, though there is an index on (u,v) and I haven't specified any particular order]
Admin
Sir or madam, I'm afraid that's a comma splice. I will have to take you into custody and drub you soundly with a truncheon.
Admin
Stupid MySQL. What's a good way to select n (figure about 20) random records in a random order with no duplicates? This one still confuses me.
Admin
Can't you "select into" a temporary table, to get those 20 random records?
Admin
But, you forget that rand() is a row-level function. Therefore, it *must* be evaluate for every row. Thus, since the value is different for each row, the calculation cannot be (completely) optimized.
Admin
Some of the previous posts showed horrible solutions that were no better than the original code. Here's how I would do it:
Admin
I don't do Mysql as I prefer Postgresql (http://www.postgresql.org) a much more mature, feature-rich and efficient open source database (BSD style).
But generally, I believe the best way is to :
1) select count(*) from table
2) choose a random number 0 - count(*)
3) select * from table order by <primary_key> limit 1 offset <random number>
On posgresql select count(*) is expensive (but accurate), so one might want to keep that value cached in a table (if you have more than a few tens of thousands of rows)
Benefits of this approach:
1) as the data is sorted by the primary key, the data should already be indexed (that is cheap access via index scans)
2) you will always get a row back, as there is no way that row could not be there (if you select where id = <random number>, you risk having a hole there...)
3) if you need more than one, just UNION multiple lines upon each other, the database should still be able to optimize the query into one operation
But really, there is no way of saying up front what is fastest.. you need to analyze the query plans (EXPLAIN is your friend). Any significant query must be profiled before going into production...
Admin
You only have an index on u,v? I'm not sure, please do the explain-thingy, but shouldn't you have an index on u separately for this query?
Either way MySQL can't use the index: the index is sorted, so it can only get number 10.000 pretty quick if it's sorted :)
Admin
Assuming the goal is a uniform distribution over all rows, this fails horribly if the id's are not evenly distributed.
Admin
It was a setup!
Admin
Admin
Well I pretty sure that MySQL arranges index on(u,v) in such a way that first 4 bytes of each entry corresponds to index on(u) - since many plans shown by EXPLAIN contained the (u,v) index, where (u) would be applicable, even when I had both. Due to hard disk shortage I decided to drop the (u) index.
However, it seems totally irrelevant to me, because there is no "ORDER BY" clause in my query, so the index could be used here only to omitt reading rows directly from table - but there should be no difference between reading them from table and from index, since the row contains exactly the same data as the index [but in different data structure - I'm not sure what's the performace impact of cache losses and so on]
What seems important is that MySQL's indexes (B-trees) do not contain positional stats (like size of the subtree etc.) so they do not really help in finding n-th element when the ID's are not continuous.
mysql> explain select * from friends limit 1000000,1;
+---------+-------+---------------+-------------+---------+------+----------+-------------+
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+---------+-------+---------------+-------------+---------+------+----------+-------------+
| friends | index | NULL | friends3_uv | 8 | NULL | 15996058 | Using index |
+---------+-------+---------------+-------------+---------+------+----------+-------------+
1 row in set (0.00 sec)
Admin
Absolutely. It's also a problem if the ID's are not integers. The way I would do it (at the risk of adding another WTF to the pool) is to
- Select all the ID's into an array (sorting/indexing is irrelevant).
- Pick an ID from the array at random (unlike the ID's, you can rely any integer between 0 and array length-1 to be a valid array index)
- Use that ID to select whatever other columns you need.
It scales linearly (fixed # of bytes per index * N rows, no sorting), and it's not RDBMS-specific.--YY
Admin
From a database engine point of view, wouldn't the most efficient way just be to randomly traverse the B-tree? I mean, shouldn't you hit a record in less than 2^64 steps each time, and still maintain an even distribution? (which is not the case in some of the suggestions here).
The question is if it's possible to do it in SQL though...
Admin
whoops, just realized what I wrote.. Of course I didn't mean 2^64 steps, that would be ridiculous. Just 64, as that should be the number required to reach 2^64 records. ;)
Admin
So far, this is the sanest post out here , as in provides the best solution so far.
Random sampling in SQL seems to be a domain of knowledge itself, this is what I found so far and think it is worth it:
I myself would probably use the practical approach, as it is simple widely available even on feature poor systems, such as MySQL 4.0 and likes.
P.S. A note the the editor: order by random() is not a wise idea :)
Admin
Yes, but it is in their jurisdictionary!
Thank you, I'll be here all week.
Admin
Getting the COUNT(), generating the random number within the program, and then using LIMIT to skip a random number of rows is the only good way to get a random row.
As far as I know, the only way to get more than one row in random order in a single query is to ORDER BY RAND(). I've heard that something like this: "SELECT RAND() AS random, x. FROM x ORDER BY random" is better performance-wise. Anyone know for sure?
Admin
Found her. The person whose PHP code I've been fixing for the past year. >:(
Okay, not really, because I know from his name it was a guy. But seriously, things like that were like his artist signature. Select all the rows from a table, then use a loop to get the one you want.
In some cases, use a loop to retrieve additional data about each row (one connect and query per loop iteration) from other tables. Joins are not for everyone...
Admin
You could always just generate multiple random numbers and use IN (.. list of numbers ..) or similar. Not elegant, but beats a full table scan for large tables.
Oh, and including RANDOM() in the selected columns should not make it any faster, unless the DBMS is badly designed/has a crap optimiser.
Admin
Your STILL a rookie baby.. :D
Admin
On postgresql
SELECT * FROM table ORDER BY RANDOM() LIMIT 1
takes about 50 seconds to execute on a table with 1 mln records
I think I'd go with:
SELECT count(*) FROM table
followed by:
SELECT * FROM table OFFSET $random LIMIT 1
Admin
Better methods than this is already mentioned in this thread, so why come up with this bs now? I wont even tell you whats wrong about this, if you dont understand it go do something else is my advice :D
Admin
P.S A note to the editor: order by random() is not a wise idea :)
Exactly. Notice that (in her reply) she "fixed" the problem by using array_rand, which suffers from the same problem: reading the entire table to get one row. Making the same mistake in the SQL engine is the next logical step. (Of course, it'd be nice if the optimizer took care of the problem.)
Admin
Except for the fact that the distribution of IDs might not be uniform and hence prevent a uniform distribution of random results, what's wrong with it? It's clearly the most scalable solution from a practical perspective.
Admin
Your answer is better then the one in the article. If you do a "order by random()" you will select a random number for each row in the database and sort by this value (unless the db recognizes the idiom and have some sort of quicker way of doing it), just to select the first row of this sorted. If the database is small enough this may not be bad, but if it has millions of rows this could be bad.
Another solution would be to use a where clause instead of a random sorting.
Admin
> It looks lame, but I don't see other option to deal with incoherent set of ID's
You shouldn't treat IDs as numbers, as they're not intended to do math with. Keys should be informationless.
The Real WTF of your code is that if your IDs are not evenly distributed (for example because some records have been deleted) some records will pop up more frequently than others.
Instead of WHERE ID>='$r', use the row count to create a query that does a LIMIT xxxx,1 (where xxxx is a random number from 1..rowcount):
In pseudocode:
function randomquery($rowcount,$query)
{
$r=$rowcount*rand()+1;
return "$query limit $r,1";
}
// $rowcount=number of rows in table (select count(1) from friends)
$ask=randomquery("select * from friends",$rowcount);
This will return you a random row with equal probability for each record, regardless of their ID.
Also, it can be used on any table (or even joined tables) because it does not require any field names.
Admin
For those methods that use RAND to sort the rows, it is not possible for the optimiser to take care of it: it can't know which are the n lowest random values until it has generated them all and sorted them all. I suppose it could keep a list of bookmarks to the n lowest values generated so far, but that would be an optimisation for this specific query.
Admin
I'm going to bookmark this whole thread, as i am going to want to pull random things out of a DB, and i don't want it to be WTFy. Not that i know anything about Relational databases or anything, but i assume once i have the basics down that doing this sort of stuff (with the help here) will be almost trivial.
I just hope i can get the basics down. :-)
Admin
Look peeps the real problem here is that relational databases are not optimized to retrieve random data. The primary purposes of a relational database are to record, organize and report on data, to make it LESS random in a sense. The most random thing about relational databases is their widely varying implementations of SQL syntax.
Today's WTF is an example of a poor way to do this on just about any platform, and the suggested solution of ORDER BY RANDOM() may be worse in some situations. The "best" solution to this will most likely be DB platform and domain specific.
Been working with database driven business applications for nearly 20 years and have yet to encounter a situation where random data was desired to be returned. I have needed the occasional random number, but random data? Can anyone give an example where that is useful? Only thing I can think of is a "Quote of the Day" scenario where you are unlikely to have millions of rows . . . So while this is an interesting mental exercise I wonder about the WTF'ery of a business application where random data is desired.
Admin
Well I was totaly aware of the problem with anomalies in key distribution. However I was looking for fast solution, as the random thumbnails were supposed to be presentat at each webpage. Therefore I decided to sacrafice the quality of the solution :)
Your solution was already posted, and I have already answered to this approach - it's slow in MySQL 4.0. I don't know why (problably MySQL lacks some statistics in it's B-tree) but the time required to fetch N-th row using LIMIT N,1 is proportional to N (or at least strictly monotone:) )
Admin
To all you CHOPS who suggest that you SELECT COUNT(*) FROM BLA and then choose a random number between 0 and the result and the select are living/working in a dream world.
What about application servers that generate a seed for the unique id on every restart ?
What about seeding the PK to not start at 0 ?
Admin
Slightly off topic - Is this also a requirements / analysis WTF.
Too bad we don't have any context for why it needed to be random in the first place.Does it truly need to be random, or just perceived to be random? Is there also requirements that all records should be viewed over time. Should we replace the word 'random' with 'not the same every time'
I'm thinking of the case of banner ads or similar where the client just wants the content to change on subsequent requests. You really wouldn't want to be completely random because you can't guarantee that everything gets viewed and the quality of the random algorithm could bias your results.
You might also have contractual obligations to show content, and if random, you can't assure the business it will be shown (unless of course you can convince the boss that your website will still be running when the universe dies)