- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
1030 possible values? And he's already used up the frist one in the Easy Reader Version?
Admin
In the source With 1030 possible values.
Admin
There used to be at least one well established, yet buggy uuid (or was it guid?) implementation, which was prone to producing duplicate values in highly-parralel/virtualzed setups.
Admin
TRWTF is the being stripped of its vertical-align property so 10^30 shows up as 1030.
Admin
It's possible, I suppose, that Remy meant
1.0e+30
as an approximation, but expressing it as "X to the Y power" using superscripts got mangled by the CMS's formatting madness.Admin
Yep, the RSS text is fine.
Admin
Bonus reading: Raymond Chen on GUIDs, and why using only half of them is a bad idea:
https://devblogs.microsoft.com/oldnewthing/20080627-00/?p=21823
Maybe Krysk's cow-orker read that article, misunderstood it, and panicked ...
Admin
So if there's a collision, we'll just try again and not check if there is a collision.
Admin
Even if the programmer wanted to test for a rare GUID collision, then why not a double collision, too.
Truly expert programmers worried about a collision would use an obscure structure like a "wile loop" of even a "do while loop" to check keys until you got a unique one. The code would have been less complex (in the do while case) because you wouldn't have to duplicate the generation and assignment.
Admin
Fun fact: when we last did our redesign all those years back, the person who did the stylesheet rules was of the mind that you first clear out ALL the default CSS rules, and then put back the ones you're going to use. That was a trendy practice for like, a week, but yes, it's stupid and dumb. I keep meaning to fix it, but I keep not getting around to it, and usually just plop the relevant rules into the article stylesheet. I forgot
sup
was one of the broken ones.Admin
Why do they need a GUID to start with? Usually, when someone wants a GUID is to avoid having to do a list lookup, in their case an incrementing value would work.
Admin
I bet it sounded just awesome in your scrum meeting
Admin
I think a lot of this comes down to programmers often are bad at understanding risk. In this case GUIDs are really supposed to be unique. Provided you are not on some strange embedded platform using Bob's Ultimate UUID generator, libbobuuid or something you should probably trust they are. Chances they are not are likely a lot lower than your checking code creating some terrible performance or resource consumption problem attempting to verify their uniqueness are probably much greater than any possible collision.
Admin
The biggest WTF here is the general misunderstanding of probability which makes people - Remy included - radically mis-estimate the chances of a collision. It's analogous to the birthday problem, so you need to look at the number of pairs - which increases exponentially.
UUIDs are only likely to be unique for datasets with reasonably small numbers of entries. It's still uncommon, but not entirely unheard-of, for people to work with tens or hundreds of billions of data points, and at that point uuid isn't a large enough space.
Admin
@Foxglove: this is explored in the Wikipedia article. https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions
You would literally need billions of billions (specifically, 2.71 billions of billions) of UUIDs for a 50% probability. "Tens or hundreds of billions" is… exceedingly unlikely to trigger this issue.
Admin
Chucker>
You would need billions of billions to have a 50% chance of hitting a collision with a single new entry (if WP is correct, I haven't checked). But if you have a dataset with billions of objects, you aren't adding a single new entry, you're adding large numbers. Even billions of datapoints gets you way down from 'not until after the heat-death of the universe', to 'not all that unlikely'. Tens of billions is enough to get you into 'almost certain to happen at some point'.
For our purposes, the number of pairs in a set is n^2. 10 billion is 10^10, squared is 10^20 pairs. 10^30/10^20=10^10. Which, as we've just said is 10 billion. So your chances of a collision when adding a single new value are only one in ten billion. Add a billion new values, it's 1 in 10.
As I said, people are really bad at estimating this kind of stuff. By the time you reach datasets with 10 billion points, you don't have to increase by an order of magnitude, just a doubling, to be almost sure to get a collision.
Admin
An if's no good, you need a while loop.
Admin
@Dave: No, to all of this. In order to have a 50% chance that a single new entry causes a collision, you’d need 2.65 undecillion existing entries. In order to have a 50% chance that any two entries collide, you’d need 2.71 quintillion entries.
Admin
There's also sequential UUID's which some systems use for various reasons.
Admin
No, it's a 50% chance for a collision within all those numbers, not for a single one. So your calculation that follows is wrong.
Admin
Odds on that at some time in the past there was some dreadful home-brewed "random" id generator and it's been swapped for the library call and the unrefactored code & comments have been left to fester.
Admin
TRRWTF is people bringing up the birthday problem when it's already been accounted for:
Emphasis mine, citation in original article.
Admin
I don't know if you're trolling or just confused, but if you start with 10 billion elements, and add another billion, you only have 11 billion elements. Those 11 billion elements form about 6e19 pairs, each with about 1e-30 chance of colliding, so your chance of having a collision anywhere is still less than 1 in 16 billion.
Admin
Indeed, people are bad at estimating this kind of stuff. Your comment, as well as Foxglove's, are excellent examples of this:
The number of pairs increases quadratically, not exponentially as Foxglove claims. (Unless Foxglove was just throwing around mathematical terms without meaning, which would be even worse.) That's a huge difference: With exponential growth, 2^100 is astronomical, whereas with quadratic growth, 100^2 is nothing.
As you (Dave) say, the number of pairs in a set is O(n^2) (not really n^2, more like n^2/2, so let's use O-notation to cover for this), so far so good. And as you say, "10^30/10^20=10^10. Which, as we've just said is 10 billion." But that's already the (inverse) probability of a collision among all of these pairs, not for a single new element.
That's where your mistake is -- basically you turn O(n^2) into O(n^3) when you say "Add a billion new values, it's 1 in 10." But you're not matching each new element with each pair of old elements, but just with each old element. So it's more like 10^9*10^10/10^30 = 10^(-11), i.e. one in 100 billion. Not for a single new element, but the total chance of collision for a billion new elements.
Admin
I was told that there would be no math.
Admin
Your comment makes me wish this forum had a thumbs up button. :-)
Admin
This kind of thing is why any true enterprisey application outsources the complex decisions of GUID creation to a GUID-as-a-service provider. Your GaaS provider can then worry about the probabilities of collision so you don't have to.
Admin
The problem with very low chances is that enough people / processes do them they are bound to happen. A wise man once said that a 1 in a million change is a certainty.
Admin
You might want these so there's never such a problem in the future:
⁰¹²³⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉
Admin
"Easy Reader Version: Here, have a UUID. There's a finite amount, so don't waste it: 7014c167-ed9c-4f3e-a79a-326b84d75c8a"
Let me introduce you to http://www.wasteaguid.info/ :D
Admin
An unwise man quotes wise men out of context. Yes, 1 in a million is nothing to a computer. But we're talking about 10^30 which is 1 in a million million million million million.
Bitcoin hashing difficulty is currently ~10^23 if I got that right. The combined power of all current miners around the world would take some 200 years to get a 10^(-30) event.
Admin
This is 100% some sort of parallelization problem. If they are not seeding the RNG properly, different machines/threads could occasionally produce the same uuid4 if starting at the same time.
Admin
Eh, I can see why. The craptacular CMS at work uses UUIDs for links. It is supposed to resolve these when shipping pages, only sometimes it doesn't, because it's craptastic, which is how these links slowly seep through our pages. This would be ugly-but-working, because they resolve as 301 redirects.
Until… until somebody got a UUID collision. The resolution-on-shipping is broken and replaces everything that looks like a UUID, even when the link is not in the same site. To the user, it was "I try to put in a link to page-of-other-dept, but when I save, it always turns into a link to other-page-of-their-dept". It was easily fixed, fortunately, but yeah, "it's a one-in-a-million chance, but it… might… just… happen"
Admin
The chances of a UUID collision in a set of 103 trillion UUIDs is 1 in a billion. The chances of that collision causing a major production outage, forcing you to have to drop everything and go into work, at 3am, on your birthday, when you’d just been invited to join <your favourite celebrity> for drinks at an exclusive club... well, that’s practically a certainty!
Admin
Sequential UUIDs have s much smaller chance of any collision at the cost of having many collisions if you have one; the expected number of collisions stays unchanged.
Admin
If you have sets with billions of data points, you just use sequential GUIDs for each set. And suddenly one billion datasets of one billion points gives you less than a one in a billion chance of a collision.
Admin
Worse than that; we all got the same one so who knows how many collisions that has caused already.
Admin
Here's an analogy: I'm approaching a T-Junction in a vehicle of my choice. Regardless of which direction I want to turn, I check right and left to see if the way is clear before turning. Now I'm at a T-Junction with with a one-way flow of traffic. The take I get from these comments is that only need to check the direction from where the traffic is coming from. But I will still check both directions (personally I check right and left when my traffic light is green, joining a round-about etc).
The thing about a 1 in a googolplex chance is: the next one could be "it" regardless of it being the 1st or googolplex -1 time you do it. As a developer I am more concerned about not having to deal with the consequences (unforeseen or otherwise) of a crash (pun intended) than the speed at which I can turn the corner (mix-metaphor intended - for comic effect)
Admin
TRRRWTF is a technical website that doesn't have MathJax enabled.
Admin
The number of pairs increases quadratically, not exponentially. More to the point: the chance of a collision is less than Q^2 / 2N, where Q is the length of the queue and N the number of possible values. For uuid4, there are 121 random bits so set N = 2^121. Let's round hundreds of billions up to one trillion and set the queue length Q to 10^12 < 2^40. Then the probability of a collision in the queue is less than 2^80 / 2^122 = 1 in 2^42.
Admin
I don't like people equating "very small number" to 0, even though they look just about the same. Also, as programmers, should we not be more concerned with potential edge cases? In all of these "you're never gonna have a UUID collision" arguments, no-one seems to bring up what happens in the case of a collision. Most of the time, it's probably nothing e.g. whatever the user is trying to do fails, and they have to start again. No big deal. But what if your use-case is highly sensitive, and having a collision leads to unexpected and potentially catastrophic behaviour?
Not that I'm suggesting that is the case here. I just don't like the blanket "the probability is extremely low, so we can always act like it's zero" attitude.
Admin
So the obvious solution is to use 2 UUIDs to double the chance of not getting a collusion.
key = uuid4() + '-' + uuid4()
if(key in self.unloadQueue): # it probably couldn't possibly collide twice right? # right guys? :D key = uuid4() + '-' + uuid4()
self.unloadQueue[key] = unloaded
Admin
Bad analogy. A more apt analogy would be, before each junction, to stop and call the weather service (because a thunderstorm might suddenly be approaching and lightning hit you just that second), then call NASA (a satellite might have crashed and its debris hit you), call the Pentagon (nuclear war might just be starting), to be safe also call the Russians too, in the unlikely case you haven't been arrested by then call geologists (earth quakes, volcano eruptions), biologists (pandemic, well ...) etc.
Wrong-way drivers are a certainty (worth checking for), the kinds of events above are rather unlikely and most people don't bother to check for them regularly. But 1/googolplex is much much much much ... (sorry have to cut here because of character limit ...) much more unlikely yet than those. Your result getting falsified by alpha particles is much more likely, how are you going to protect against that?
BTW, I'm actually not a big fan of UUIDs, and whenever feasible, I prefer other kinds of IDs. But drawing bad analogies and (other commenters) throwing around wrong terms and overstating the birthday paradox (it's quadratic, not cubic, nor exponential) doesn't help make an informed decision.
Admin
With all of the crap implementations of random numbers that have been featured on this site, the fact that this group of people have so many people convinced that UUIDs aren't ever likely to collide is TRWTF.
If my random number generator came from XKCD, my chance of collision on my second UUID is 100%.
@Edd, @John Melville, @Prime Mover: congrats on understanding this.
@Lurk: We hope. What if they've named their homebrew uuid generator uuid4 and they're still using it?
@Anon: UUIDs are good when you have multiple nodes that could be generating new IDs, for example if you're a package delivery company and every office needs to generate IDs and you don't want them to need to close if communication breaks. But, yeah, I see them used a lot when a simple increment would work with less effort and (an infinitesimally small amount of) more reliability.
@pcooper: that's what you'd like to think. I've been in IT long enough to know how outsourcing really works: You pay someone more money to develop a solution than it would've cost to do stuff in house, and then when it inevitably breaks, you suffer the lost opportunity costs from having it be broken while the vendor acts befuddled and then eventually you have your employees fix it. This may not be true with all outsourcing, but it happens too much to claim that outsourcing really removes you from accountability.
@ivancho: That's one way, but there are others.
@WernerCD: With a bad random number generator, that can be just as liable to generate collisions.
Admin
Once you get to low enough probabilities things like "a cosmic ray flips a bit in ram, changing the result" become equally likely. Do you check for that?
Admin
If it's a crappy homebrewed random number generator it is not, by definition, a UUID.
If that bit of code represents someone taking an informed action based on mistrust of their UUID generator, then TRWTF is that they'd continue using it at all.
Yes there's some remote possibility of a collision, test for it if you want, and write a paper on it when it happens. If you're concerned about ropey edge cases crashing your software I'd be prepared to bet a few billion billion dollars there is lower hanging fruit in your codebase than this one.