• jj (unregistered)

    K.P. adds, You may notice the use of GUIDs. The resident developers were nearly moved to tears at the news that we would be eliminating GUIDs as row identifiers. Apparently a former “guru” had informed them that numeric row Ids are for row counting, and it is safest to use GUIDs for each PK, “just in case we ever have to use replication.”

    As a developer, I would be moved to tears by this comment as well. How does the customer benefit from converting to identity columns from guids?

    The last time that I checked, SQL Server still supports DBCC Checkident, which is used to re-sync identity columns.

  • (cs)

    I wonder if this is a certain well known health provider in financial difficulties whose name begins with 'i', about to be bought by another health provider whose name begins with 'I'.

    Let's just say that every primary and foreign key (in a certain enterprise application that they still haven't finished in 6 years and never will) is a GUID.

  • exaxe (unregistered) in reply to jj

    Re guids vs autoincrement. I think about it in terms of how the index will be built. With binary search, an autoincrementing id will always add the row reference to the end of the index. If the index grows, sql server can simoly add a page to the index file and keep adding references.

    New rows with a guid pk will result almost always with the row reference somewhere in the middle of the index. This will sometimes involve some memory reshuffling that an autoincrementing index will not have to do. I think guids are best used for 'cold' tables, where replication is more important than insert time. However for tables with a lot of inserts will benefit much more from an autoincrement. There are also risks of using guids when generated by the client. Guid collisions could occur if the guid generating algorithm is not a 'standard' generation algorithm. We use a wierd object db which generates a 'guid' which actually increments. I don't know whatg they were thinking but we definately cannot mix these 'guids' with .net generated guids (the chance for a guid numberspace collission is too high)

  • Pete (unregistered)

    As some have already stated, using GUIDs as the identifier of the row is not a boneheaded play. First of all, you can know the value of the key of the row you are inserting BEFORE you insert the row (rather than inserting and then getting the @@identity - in SQL Server, anyway).

    Second of all, GUIDs are a good step to eliminating bad joins. You simply cannot join a wrong table (and get results) if you use GUIDs. You can however, really get a query you think is working if you are using the same identifiers (integers) all over your database.

    GUIDs also allow you to keep the same identifiers in multiple environments.

    The cursor was stupid, but there are more than legitimate reasons to use a GUID on a row.

  • (cs)

    Dear, dear me. All these comments in favour of GUIDs, as though the only DBMS in the world was MS SQL Server. Someone even referenced an article by well known knucklehead Jeff "VB Is Da Bomb" Atwood. As the saner comments to his article pointed out, GUIDs are a stupid choice for a primary key as they fuck up indexes and arent guaranteed to be unique (something that becomes statistically more likely with the UUID schemes and generators in common use). As for arguing they are necessary for replication, that's only true if you're using a technologically dodgy DBMS like SQL Server in the first place - a DBMS with decent replication hides any requirement such as GUIDs under the hood.

  • iMalc (unregistered)

    I'd just like to add:

    YAGNI!

    Using something "just in case we ever ..." is no reason to use it.

  • (cs) in reply to Lionstone
    Lionstone:
    You're right. You're no expert. A simple index defragment or reindex (depending on what SHOWCONTIG showed you) would have given you the same speed increase for about 10 minutes worth of the work. Most of the increase came not from switching data types, but from the fact that by switching data types, the indexes were rebuilt.

    You have to know what you're using and know how to use it. Just saying "use A all the time!" and blindly following some database design "recipe" will inevitably lead you to make some pretty bad errors.

    My point exactly. Not knowing how to use it led me to the conclusion of "I don't know what I did wrong, so if you are in the same problem just stick to what you know". Granted, my post may have looked a bit "know-it-all", but that was not the idea.

    And by the way, I'm not saying I switched datatypes and it worked. I'm saying that, using two empty databases with schemas differing only by the datatype of the PKs and running the same kind of import, I got a much more efficient process with numerical IDs.

  • Gnudiff (unregistered)

    Amusing.

    I have had some experience with Oracle, Postgres and MySQL for years, and this is the first time I read about GUIDs... and not only people are debating their merits, but seem to think they are somehow pretty relevant to db design in general.

  • Anon (unregistered)

    There are many arguments in favour of using GUIDs in your database. There are many arguments againt using GUIDs in your database.

    But either way, if it's causing you to break down and cry at work, the whole thing is an excellent argument in favour of seeing a psychiatrist for help with your obvious mental disturbance.

    And if that's the biggest problem you have, then your WTF level is so low that you don't deserve to complain, anyway.

  • Z (unregistered)

    Had a lot of this in my previous company... authors learned sql on some prehistoric version of Oracle and perverted it to extremes...

    There should be something in the tools to check for developer iq :)

  • Guy Geens (unregistered) in reply to Sean
    Sean:
    Using an autoincrementing integer can sometimes lead to tables being locked for an unnecessarily long amount of time -- if you generate a new ID, you have to lock the entire table until the INSERT transaction has completed.

    AFAIK, most databases use a caching system for autonumbers. The key generator only locks around a very small piece of code (essentially returning SEQUENCE++).

    Most of the time, the sequence value is stored on disk as well.

  • Guy Geens (unregistered) in reply to Timberwolf
    Timberwolf:
    Blimey. I've seen this before at the SQL end of our shop. Similar thing, huge long cursory mess that refactors out to 'INSERT INTO a SELECT column FROM b'.

    Try this:

    • Select a set of records into Java objects
    • Work on the data, transforming it into a new set of Java objects
    • Insert the new objects into the database

    All of this was refactored into a single INSERT ... SELECT statement. Running time reduced from minutes to seconds.

  • (cs)

    re: the int vs GUID debate.

    too many times have i worked on db's with int as pk's - and too many times have i seen tables with "magic" pks, like -1 and 0 as some hack shortcut for something. i'd rather use guids and force people to do stuff properly than sort out their mess when they leave

  • anon (unregistered) in reply to Pete
    Pete:
    As some have already stated, using GUIDs as the identifier of the row is not a boneheaded play. First of all, you can know the value of the key of the row you are inserting BEFORE you insert the row (rather than inserting and then getting the @@identity - in SQL Server, anyway).

    Second of all, GUIDs are a good step to eliminating bad joins. You simply cannot join a wrong table (and get results) if you use GUIDs. You can however, really get a query you think is working if you are using the same identifiers (integers) all over your database.

    GUIDs also allow you to keep the same identifiers in multiple environments.

    The cursor was stupid, but there are more than legitimate reasons to use a GUID on a row.

    We have a winner!!!!!!!!!!!!!!!

  • bd (unregistered) in reply to chrismcb

    Most artificial keys become natural ones in due time anyway... (important: "become", not "are replaced with").

  • (cs)

    I take it this guy was on a "get paid by the line" metric?

    I suppose it looks more impressive to show to your boss, and they guy might be onto something here.

    I might spend ages figuring out a neat set-based solution that results in minimal lines of code. When my boss sees this, he might think that he's paid me for a day to write 10 lines of code. Instead, I could thoughtlessly blaze through it in a nasty pointless loop, write 50 lines of crap code and have far more to show him!

    If it runs slow? Hardware not good enough!

  • Jasmine (unregistered)

    I just got out of a (VB-based, yeek) company where all the procs were written that way. I don't care what any of y'all say... there is no reason for that. None.

    This is why VB programmers shouldn't be allowed near the databases.

  • Tinkerghost (unregistered)

    I have a table with about 16M rows in it on average - 3 months of data - normally it's searched by customer ID - nicely indexed & fast. However, I have one guy who needs to do his searches on a date field in the data - which is ungodly slow. My obvious solution was to create an index on the date field - not only did it not help the date search, it slowed down all of the other searches as well. I also found that some things just aren't worth the time - the count() function in postgres is almost useless on large tables - 10 - 20X longer than just selecting all the id's for the WHERE clause & counting the records. It seems count() does a row scan of the table & ignores the indexing. Also sql optimization is great - until it isn't. I spent a couple of days tweaking the order of a statement & pulled in about a 20% decrease in the run time. The live server's data showed a statistical shift in the data vs the old dev data that no only made my changes useless, they were detrimental.

  • slamb (unregistered) in reply to LizardKing
    LizardKing:
    As the saner comments to his article pointed out, GUIDs are a stupid choice for a primary key as they fuck up indexes and arent guaranteed to be unique (something that becomes statistically more likely with the UUID schemes and generators in common use).

    I see complaints about non-uniqueness of UUIDs often, and I just don't get it. Probabilities as low as the ones we are talking about are so close to 0 that they are insignificant compared to other sources of error which, though harder to pin an exact probability on, are way more likely.

    Enough talk. Time for math. If you have one billion records which must have distinct 128-bit randomly-generated numbers, the probability of failure (according to the wikipedia article) is roughly 1-e(-10^9/2^129). That's 0.000000000000000000000000000001. That's way less likely than the probability of a conflict with an auto-incrementing key - there a single flipped bit will likely cause a conflict. But of course, bit flips are prevented by error correcting algorithms in storage and transmission. Those algorithms provide statistical guarantees as well...and if you do the math, you'll see that those statistical guarantees aren't as good as a 128-bit random number...but still good enough.

    Realistically, the only way either method will fail is if someone fucks up - picks a bad randomization algorithm, makes a bad cut'n'paste on a manual database change, uses sequences in a broken way in the presence of concurrency, etc. Pick the method where the probability of fuck-up is lower, and don't worry about other sources of error which are less likely by many orders of magnitude.

  • Matt Burgess (unregistered) in reply to slamb

    Another point people forget with regard to collision is that there is a difference between a collision and a MEANINGFUL collision.

    If you have 10 tables and you're joining on keys in them, the chances that you'll have duplicate GUIDs are astronomically low. The chances that you'll have duplicate keys in relevant places where a conflict will result... is lower still.

    That being said, I've worked in places where a million keys a day are created. Even a low chance when repeated often enough becomes a certainty.

    Still, collision is not really that big a risk. Personally I use auto-incremented ints. I just prefer them. Smaller. Neater. Faster. :)

    One of our senior programmers (not my supervisor) tried to convince me that we should use GUIDs for my project because they make it easier to replicate, merge, etc. Great, but... we're never going to do that. Why reduce performance constantly to optimise for an occurrence that will never happen?

  • slamb (unregistered) in reply to Matt Burgess
    Matt Burgess:
    That being said, I've worked in places where a million keys a day are created. Even a low chance when repeated often enough becomes a certainty.

    No, that last statement is not generally true. For independent events with p < 1, there is no finite number of trials which will result in a certainty of it occurring even once.

    By the Pigeonhole Principle (where events are not independent), a conflict becomes a certainty when you need 2^128+1 distinct UUIDs. [*] That's an unimaginably large number.

    As for what's probable...if you generate a million keys a day (as you said) for a million days, a collision is still inconceivable. With scale=10000, bc gives me a negative number for that approximation of the probability. It's obviously greater than with 10^9, yet it's small enough that bc can't produce an accurate result. I'd need to be more clever to give a precise answer, and it's far too late tonight to be clever. Nevertheless, rest assured that you're still a long, long way from this occurrence being worthy of consideration in your design.

    [*] - okay, there are a certain number of fixed bits designating the "random" type of UUID and whatnot...offhand, say it's actually 2^124+1. Near enough as makes no difference.

  • SQL Server Ben (unregistered)

    Although unlikely, there is one possible explanation for using cursors instead of set-based SQL: higher concurrency.

    Suppose you were updating 100,000 rows with this code (assume the GUID was a foreign key back to some parent table, and not the PK) The original logic would commit after each update, so it would gradually "crawl" through all 100k rows without blocking them.

    The set based SQL would require blocking all 100k rows for the duration of the update.

    We use cursors for high concurrency operations in production environments that are sensitive to long locks.

  • (cs)

    Hrm, those complaining about the loss of GUIDs, obviously haven't had to actually... USE them. They're fairly nasty to generate (you need a unique hardware ID (usually Ethernet MAC), a timestamp, and a couple other pieces of information before you can make a GUID).

    Which ends up in a horrible mess if you actually have to find a GUID in a text file, or look for a GUID in a log file. Sure they're all unique, but comparing all those characters gets old, fast (in the end, I find people just comparing the first 4 and last 4 hex digits of a GUID). Just a PITA to type, a PITA to search, and a PITA to compare...

  • slamb (unregistered) in reply to Worf
    Worf:
    Hrm, those complaining about the loss of GUIDs, obviously haven't had to actually... USE them. They're fairly nasty to generate (you need a unique hardware ID (usually Ethernet MAC), a timestamp, and a couple other pieces of information before you can make a GUID).
    Wrong. Read RFC 4122, section 4.4 for an easier way.
    Worf:
    Which ends up in a horrible mess if you actually have to find a GUID in a text file, or look for a GUID in a log file. Sure they're all unique, but comparing all those characters gets old, fast (in the end, I find people just comparing the first 4 and last 4 hex digits of a GUID). Just a PITA to type, a PITA to search, and a PITA to compare...
    Don't manually type or compare them. Cut'n'paste and search result highlighting are your friends.
  • (cs) in reply to Worf
    They're fairly nasty to generate (you need a unique hardware ID (usually Ethernet MAC), a timestamp, and a couple other pieces of information before you can make a GUID).
    Please explain how any of those (non)requirements make it any more complicated to call a function. I'm really curious. Are you somehow implying that it's difficult to get a unique hardware ID or timestamp?
    Just a PITA to type, a PITA to search, and a PITA to compare...
    You know what's also a pain to type: the Constitution. How often do you need to type the Constitution? Never. How often do you need to type a GUID? Never. They're generated with function calls and no sane person would actually type them out when there's copy/paste.

    I find it really hard to believe that it would be any harder to search for a GUID with a known and consistent format than it would be for an arbitrary integer in a log file full of dates, times, error codes, and all sorts of other arbitrary integers. I could be wrong, but I think regex is pretty good at finding strings of a known format (and a heck of a lot better at that than at determining what type of data an integer is supposed to represent).

    As for comparisons, I like that just comparing 8 hex digits of a GUID is every bit as accurate as comparing all 8 hex digits of an int. Those other digits you're not comparing? Think of those as collisions where the same int is being assigned to different rows in your database because you ran out of unique integers. Oh, and if you can't just create/use something to compare arbitrary length strings (you know, like those functions built into every worthwhile programming language/command shell) then you probably shouldn't be worrying about things like GUIDs anyway.

  • (cs) in reply to Z
    Z:
    There should be something in the tools to check for developer iq :)
    I hear that some companies use brain teasers during interviews to see how smart people are. Anyone have any opinions on whether this is good or bad? :)
  • Garrett Fitzgerald (unregistered) in reply to damncrackmonkey
    damncrackmonkey:
    I find it really hard to believe that it would be any harder to search for a GUID with a known and consistent format than it would be for an arbitrary integer in a log file full of dates, times, error codes, and all sorts of other arbitrary integers.
    Actually, I can see one place where ints win over GUIDs for searching: you can search for a contiguous sequence of ints with BETWEEN. That's a tad harder with GUIDs.
  • BFair123 (unregistered) in reply to Lisae

    You're joking, right? When in the world would you have a user provide the unique identifier for a row? Are you going to have a textbox on a form where they type it in? Seriously!?!? Anyhow, I work with an application that uses GUIDs for the PKs and its an absolute nightmare. Think of this, if your PK is a clustered index the PK records are physically stored in sorted order. So, when you randomly create a GUID it will get stuffed somewhere in your PK index, which means things may need to get moved around to make room for it. Because of this we've had to turn many PK's to be non-clustered indexes on tables with high activity. Now, if you'd just have the fore-thought to use an identity column that auto increments your PK record would go at the end of the index.

Leave a comment on “An Amusing Way of Saying UPDATE”

Log In or post as a guest

Replying to comment #:

« Return to Article