The Daily WTF: Curious Perversions in Information Technology

2020-08-03 Reply Admin

On at least one database there's a good reason not to use the UUID type for IDs, even if they're supposed to be UUIDs. They don't index well, even worse than using a string for it.The WTF here is really how that value got into the database at all.

Steve_The_Cynic · 2020-08-03 Reply Admin

The WTF here is really how that value got into the database at all.

Most likely because the field is a varchar or similar, containing a stringified representation of the UUID rather than an actual UUID.

Which is, of course, a fairly strong WTF.

Jaime · 2020-08-03 Reply Admin

Which is, of course, a fairly strong WTF.

Yet, we had someone rationalize why they might do it in the very first post.

2020-08-03 Reply Admin

Most likely because the field is a varchar or similar

Oh yeah definitely - but like I said there's a good reason to do that.

Though actually on looking back at the ticket where we looked at this I'm misremembering, the indexing was better using the UUID column type. I think we decided not to change to a genuine UUID type because we support multiple databases and it was more of a pain than the benefit (particularly as SQL Server is not a big part of our customer base), rather than because it wouldn't actually help.

The WTF to me is the business logic (or ability for people to run insert into ... on production systems?) to enter invalid values.

2020-08-03 Reply Admin

If only SQL had a way of indicating that a given field does not contain a value... maybe someday we won't need strings like "Releases does not have a UUID" to indicate such things.

2020-08-03 Reply Admin

UUID have pros and cons. They're great for uniqueness but in SQL Server the clustered index determines the physical order of the table, so if it happens to be a UUID (well a GUID really) then each insert wil go at random places and if there's not enough free space on each page it will cause lots of I/O.

You can try and cheat by using somewhat sequential GUID but it's still a big data type and search will not be as fast as integers.

On the other hand, if you have a really really busy system that inserts a ton of records, a GUID with low fill ratio on your pages is a great design because unlike a sequence, there won't be a I/O hotspot at the tail of the table.

Regardless I tend to use this type for unique ID because it makes it easier to generate keys on the client side (ex: Entity Framework) and the uniqueness is hard to beat, especially when the data moves around.

2020-08-03 Reply Admin

While null would have been a better choice, I bet the code would have crashed on that too. Coders tend to forget that nullable values can be null.

2020-08-03 Reply Admin

IANA DBA, but if MODULE_UUID was declared DISTINCT, then making it nullable would only permit one null in the entire column, wouldn't it?

emurphy · 2020-08-03 Reply Admin

Microsoft SQL, at least, lets you do e.g. CREATE UNIQUE INDEX X ON Y (Z) WHERE Z IS NOT NULL and then you can have any number of rows where Z is null, but all the non-null values of Z have to be unique.

2020-08-03 Reply Admin

Since a GUID is basically 128 bits, Raj, I refute your post. You'd have to show me an awfully convincing use case where two 64-bit integers are inefficient as an index.

Have you met Mr Claude Shannon, btw? You should. He#s a really great guy. Plus which, he can explain this ridiculously simple concept to you better than I can.

2020-08-03 Reply Admin

"Somewhat sequential GUIDs?"

You win the WTF Internets today, Raj.

OllieJones · 2020-08-03 Reply Admin

Paul Leach is everywhere. Except where he's not.

2020-08-03 Reply Admin

Feel free to RTFM and learn a thing or two.

https://docs.microsoft.com/en-us/sql/t-sql/functions/newsequentialid-transact-sql?view=sql-server-ver15

Jaime · 2020-08-03 Reply Admin

Feel free to RTFM and learn a thing or two.

TBH, the whole "guid makes a horrible clustered index" thing is really a very obscure problem that most people will never have.

First of all, putting a clustered index on a unique-ish column is pretty non-standard. Usually the best choice for a clustered index is something that's either a common value among a related group of records (e.g. foreign key), or something that's range selected or sorted. If none of these happen to a particular table and you mostly select a single record at a time, then it might make sense to put a clustered index on a unique column.... but still probably not. Your other choice is to have no clustered index and leave the table a heap. For single record queries, a nonclustered index lookup followed by a bookmark lookup on the heap will read at most one more page than a clustered index. Sometimes it will be the same number of page reads or even fewer if the key-to-row size is significantly small for the nonclustered index that's used.

About the only real difference you can make with a clustered index on a unique column is to improve performance by using a guid and taking advantage of the intrinsic page distribution of the random key. In other words, guids almost never hurt indexing and sometimes help.

Bananafish · 2020-08-04 Reply Admin

One WTF is using 3 different columns as unique key/row identifiers. Another WTF is assigning [autoincrementing] integers to unique module names (... to increase their uniqueness? ... to keep track of the order in which they were installed?? ... to make sure one can install two instances of the same MODULE_NAME???). But TRWTF is using UUIDs as identifiers.

2020-08-04 Reply Admin

No dude. The clustered index should be the primary key if at all possible.

Clustering a non-unique field just makes SQL Server create a hidden ‘uniquifier’ column to the index. Takes more space, and impacts performance on all other indexes on the table.

Jaime · 2020-08-05 Reply Admin

No dude. The clustered index should be the primary key if at all possible.

I hear this all the time, and it has merits ....but....

You only get one clustered index. It is often more beneficial to spend that one clustered index on speeding up the parts of the system that operate on large amounts of data at the same time. Adding a clustered index to the InvoiceLines table on InvoiceID will make a lot of aggregate queries go from reading 100 pages to reading 3. While shortenting the clustering key will make a point query go from reading 3 pages to reading 2.

I usually don't care about disk space. The most expensive part of a SQL Server is the licenses, the disk subsystem is like the third or fourth biggest item in the budget.

Adding a clustered index on a constantly increasing key will reduce table fragmentation, but I don't care much about that either. The default in SQL Server is to leave space in leaf level pages for two rows. A pretty common number of rows per page is 100, so that means I might be able to add 2% to a page without splitting. If you have randomly distributed inserts, you won't be adding more than 2% in the long run, here's the math:

Lets pick random numbers, it doesn't matter what they are feel free to change them, it will only affect the stabilization time, not the end result -

Starting table size: 10,000,000 rows New rows per day: 1,000,000

Day one, you add 10% to the table, you get splits, crap. Day two, you add another million rows, but that's only 9.09% of the table size since the table is now bigger and you rebuilt it last night. Continue this until day 41, and now you are only adding 2% of the table size every day and the default fillfactor prevents page splits.

See, the very problem you are trying to fix will fix itself. If you are adding a lot of rows, then the table will quickly get large enough that the rows you add are a drop in the bucket. This doesn't apply to situations where the data grows exponentially - but this is a special case for IPO unicorns.

2020-08-05 Reply Admin

You’d rather cluster on the InvoiceID and a uniquifier, rather than InvoiceID and InvoiceLine? This doesn’t make a whole lot of sense. Just add both to a clustered primary key.

It sounds like you’re going a long way to justify not doing it the right way. Your non clustered indexes provide the speed for any other query conditions - searching by customerID, ShipDate, etc.

Jaime · 2020-08-05 Reply Admin

You’d rather cluster on the InvoiceID and a uniquifier, rather than InvoiceID and InvoiceLine?

Yes, this doesn't make sense. InvoiceLine is a table name and you are adding it as an index key.

You seem to be very hung up on the idea that the pointers in the nonclustered indexes will have an extra unique-ifier, making the leaf level pages less dense. I am not disagreeing that this happens... but I did show some math on why your insertion page split claim doesn't hold water. I can do similar for this claim too.

I'm not "going a long way to justify not doing it the right way", I'm trying to have a real discussion on what is the right way. You are simply making a statement and saying that people who disagree with you are wrong.

Bookmark lookups are expensive. Choosing a clustered index that supports a range search or and large number of children for a parent can reduce millions of bookmark lookups per day. That savings often is greater than the cost of the things you mentioned.

2020-08-05 Reply Admin

Ok, you mentioned InvoiceLines - presumably a table containing columns for both InvoiceID and the InvoiceLine (ID?)

" Adding a clustered index to the InvoiceLines table on InvoiceID will make a lot of aggregate queries go from reading 100 pages to reading 3. While shortenting the clustering key will make a point query go from reading 3 pages to reading 2."

You appear to be making the point that InvoiceLines will perform better using a non-unique key, rather than a combination of InvoiceID and InvoiceLine - which, presumably is unique in that table.

Also, I never made a claim concerning page splits.

"AnonDBA (unregistered) in reply to Jaime 2020-08-04 Reply

No dude. The clustered index should be the primary key if at all possible.

Clustering a non-unique field just makes SQL Server create a hidden ‘uniquifier’ column to the index. Takes more space, and impacts performance on all other indexes on the table."

I'm simply pointing out that a unique clustered key is far more efficient than using uniquifiers. If you need an aggregate, you create a non-clustered index on the aggregate key (InvoiceID).

Jaime · 2020-08-05 Reply Admin

I'm simply pointing out that a unique clustered key is far more efficient than using uniquifiers.

... if your definition of efficient is solely about space usage. SQL Server didn't always create uniquifiers. In the "old days", SQL Server used to use the physical on-disk location. The problem with doing this was that any update that the moved a record (page split, move due to size increase), would have to touch all nonclustered indexes. This was fixed by having nonclustered indexes access to data via the clustered index - which necessitated the uniquifier. SQL generated uniquifiers aren't necessarily worse than user specified uniqueness - but maybe will take a bit more disk space.

If you need an aggregate, you create a non-clustered index on the aggregate key (InvoiceID).

Yes, and if your query returns 100 records, those records are probably on 99 pages and you have to go read them all. If a query that returns a large number of records uses a clustered index, it will almost certainly read a single-digit number of pages. This is the biggest boon you can get from a clustered index.

You can only have index be the clustered index. It either reduces disk space usage and makes nonclustered indexes smaller, or makes a subset of your high cardinality queries blazingly fast, but not both. The second is very often the better choice. You keep saying that I'm ignoring the first - I am not - I am telling you the the second is usually more effective.

Jaroslav Sveda · 2020-08-09 Reply Admin

“if MODULE_UUID was declared DISTINCT, then making it nullable would only permit one null in the entire column“ - depends on the RDBMS. With PostgreSQL, for example, NULL is not equal to itself and therefore is always treated as distinct value. So unique index can handle more that one row with NULLs.

Addendum 2020-08-09 12:29: (I know this is about MS SQL Server, but I have no experience with it)

2020-08-14 Reply Admin

I've worked with systems where these sorts of multiple ids are actually reasonable. The UUID is the "real" unique id in the app's object model, but when this is saved in the database it gets an extra integer id because it's usually more efficient to do table joins on an integer id than a UUID, and the data is laid out better on disk that way as well. Often the integer id gets almost completely hidden by an ORM.

Meanwhile the "name" and "desc" fields aren't really ids, and always being equal to each other is possibly more of a data entry problem than an actual design issue. Although even there it's not necessarily silly -- perhaps you could use one as an internal name for logging etc while the other serves as a customer-visible name or something like that, so they wouldn't always be identical. (Supporting translated names is another case where this sort of apparent duplication crops up in the initial dataset, but I don't think that's the case here because you usually need to support more than one language so you'd need a different structure for that.)

Using a varchar field for a uuid -- and one where the value has obviously come from human entry without any kind of validation, to boot -- now that is definitely a WTF. (Although not one without precedent, as anyone working with a JSON API can attest to.)

A Unique Choice

Leave a comment on “A Unique Choice”