- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
There was some comment in the original WTF to the effect that at the point of ID generation, they could not know which IDs has already been generated, ergo the whole point of using a pseudo-random GUID generator in the first place.
So if I'm that developer, I either accept that the risk of GUID collision is small enough (Andy's retarded swap hack notwithstanding), or I go back to the stakeholders and point out that this requirement cannot be implemented as stated.
Admin
Seems I skimmed the opening part of Unique Requirement.
Even for a large series of users generating their own pools of IDs and submitting them in after they're done, the same logic still applies. If you can't check the information at the time of creation, you could likely modify the underlying implementation in a way that would be suitable. If a collision did occur, the logical way to handle that is through the correspondence between client and server. When the client uploads a data set that collides, create new ids for the client's associated information and update the client side data to reflect that change.
Mind you, given the relatively miniscule chance that such an occurrence would present itself, it should be an acceptable compromise for the client.
Admin
This article has already been submitted to and published by Visual Studio Magazine 2 or 3 years ago. (Submitted by this very web site.) Are we out of good WTFs completely for now?
Admin
You are indeed correct.
Since there was no source cited, is that a bad thing for this website or does it regularly not cite its sources?
Admin
Then again, the author that submitted it to the Visual Studio Magazine was Alex Papadimoulis.
Admin
Well, yeah, we can speculate all we want on the actual requirements and context, and depending on which variation we come up with, there are different sets of viable solutions.
To get the most enjoyment out of the WTF, I take the requirement and GUID solution at face value. I assume that they had to use a GUID, or something like it, because there's no way to detect a collision until it's too late--when it's either impossible or impractical to work backwards to the source, separate the results, regenerate, and try again. (And I mean "assume" in the sense of "let's make this one of our assumptions, for this iteration of the problem," as opposed to "I wholeheartedly believe that this is true.")
If this is the case, then your choices are bascially:
Determine that GUID is "sufficiently unique."
Come up with some other scheme that you can determine is "sufficiently unique".
Push back on the stakeholders and tell them, "Impossible as stated."
On the other hand, if it turns out that it's merely inconvenient to deal with a duplicate--say we don't find out until 3 steps down the line, but we CAN reliably detect dupes, and enough of the original request/input/whatever remains that "Reject, Regenerate, Retry" is possible--then we have a different landscape, and as you say, the rarity of such occurrences should mean the "RRR" approach is viable.
Admin
Well,
There's a number of things that aren't immediately obvious from the article.
They mention that these are the client's needs, but say nothing of the client's awareness of potential solutions. Regardless of the unique identifier employed, it will undoubtedly have the chance, scant as it might be, of duplicates. Being as the client isn't writing the software themselves (because either they're not programmers, or your group wrote some underlying dependency software that is closed source, and you're the guys to go to), the solution on handling duplications might not be obvious to the client; which is why they would put forth such a requirement in the first place.
There's really insufficient information to say whether recreating the identifiers, after they're created, is out of the question since their use outside of synchronization/submission is not explicitly stated. Given that the correspondence between the client software and the server back-end can handle the issue, short of the users writing the guid down and using that internally, the software should be easily able to propagate any necessary changes to the underlying data model. Odds are the only change would be the Guid, not the underlying data structure that represents it (since it's probably a client-side database of some sort?)
Admin
Admin
Actually, I take that back, let me rephrase:
Given that most systems store data fields with a fixed-length value; regardless of the method employed to generate the unique identifier, so long as said unique identifier is also fixed-length and cannot dynamically adjust based off of the needs of the scale of the system in place, duplicates will undoubtedly occur should the elements of the set representing the unique identifiers in use not be incremental in nature. Given an incremental system, versus random, the identifier is then limited to the finite set of elements capable of being represented in the size allowed.
Admin
Exactly. That was part of my point. :-)
Hence the potential for dialogue with the stakeholders I mentioned. We (the developers) go back to the client and tell him, "Look, based on your requirements, uniqueness CANNOT be guaranteed. However, we can make it extremely unlikely, such as blah blah million years blah blah," and present possible solutions based on the requirements as understood and on potential loosening of the requirements.
Of course, then we drastically reduce our chances of a WTF, and where's the fun in that?
That depends on how much of the article we take at face value. If there's truly no way to know anything at all about what IDs have been generated so far, as the article implies, then yes, there's no way we can guarantee that an ID hasn't been used already. All we can do is reduce the probability of a collision to an acceptable level.
If we go back to the requirements generators and try to squeeze more details out of them, we may find that there's some way to know for certain that a particular subset of valid IDs has never been used.
Yup. And why we need to make sure they understand what they're really getting, and what the limitations and alternatives are.
Yup. There is, however, sufficient information to determine that the "swap the first two digits" hack is a total WTF, and to take childish amusement from it.
Admin
Not quite. It will only "undoubtedly" occur if we generate at least N + 1 identifiers. It will occur with probability described by the birthday paradox up to that point.
This is true of the random generation case as well. The pigeonhole principle applies in both cases. It's just that with the sequential case, we always know, from a single datum, exactly which IDs have been used and which haven't, and we can guarantee that the first N IDs generated will be unique.
Admin
By undoubtedly, I was referring to a use case that required an indeterminate amount of time, up to, but not including, an infinite amount of time (since it's not finite, you can't really include it, short of making something up.) Granted, both instances are limited in the set of values that they can represent; however, in the case of the random, but piecewise (per character) finite (0-9A-F), any system which generated a sequence based off of a set of rules would be difficult to construct such that it would adjust in size as needed based off of the determination of the current length being inadequate. As you'd need to enumerate each variation of the 'random' but finite sequence to determine whether the set has been completed before increasing the size requirements.
I was more speaking hypothetically in the facetious sense.
Admin
Okay, but still, for both the random and the sequential case, there's no guarantee of a dupe until we've hit N+1 IDs. The only difference is that until that point, with the sequential we're guaranteed no dupes, and with the random, we have an ever-decreasing probability of no dupes.
And in the domain of a GUID, and assuming it has decent entropy, for any reasonable lifetime, the probability will never climb above "unimaginably minuscule."
What? You can't be facetious on the internet!
Admin
Why do the PINs need to be unique?
Admin
Admin
Admin
Admin
and what's the possibility that the "newly generated" PGUID ("processed GUID") is already not present in the system either as an PGUID or non processed GUID? they should at least think of that..
Admin
I get the academic ammusment of discussing "the chances" of the two first digits being the same or being born on the same day or day of week as someone else, but there's something I just want to point out. Those discussions all seem to assume that there is an equal chance of all outcomes. There can actually be properies inherent (or engineered) in the process of choosing that favors certain combinations. The GUID generation algorithm is essentially a black box for most of us. And for things like being born on the same weekday...doctors may not like to work late on Fridays, for example, so may induce more on that day. Statistics are like bikinis -- what they reveal is suggestive, but what they conceal is vital.
Admin
Nooo... your GUID-space would be heximated.
Admin
Once, pre MS GUIDs, I had to write a routine to generate unique strings which were printed and distributed to another branch of the company. For some reason, in the first few pages of strings, even though the strings were fairly short, some pretty serious profanity showed up embedded in the strings. So I had to add a filter.
This experience forever changed my views on randomness.
Admin
Admin
Admin
Nope, it's actually 1-in-16 since there's 16 'winning' combinations out of those 256 possible permutations.
Admin
I'm guessing it makes sense if you multiply the terms. Or use COMMON FUCKING SENSE.
Oh wait - nerd site - never mind.
Admin
Yes, I know. Not sure of the relevance or what point you're trying to make.
Admin
Exactly. And since we don't KNOW of any algorithm preference for or against the first two digits being the same, the best we can do is assume that they're random and independent, and hence the probability of a match is 1:16.
If we were to peek inside the black box and learn otherwise, we would of course change our calculations appropriately. Likewise, if we took a statistically significant sample and saw a meaningful trend in one direction or the other.
Far as I know though, all we have to go on at this point is that it's "best effort random".
Admin
Well, I was not questioning the source. The question was: if Alex submitted this to VS Mag so long ago, why did he publish the same WTF of this site just now?
Admin
yes ... so it's one out of sixteen: 00, 11, 22, ... ff = 16 combinations out of 256 = 1/16.
Admin
Could you please post your full name and a picture? We need it for one of the tables in our HR database.
Admin
What?? Absolutely not. There are 16 outcomes that we care about, out of 256 possible outcomes, so the probability of getting one of those we care about about is 16/256 = 1/16. Same number, but your reasoning gets to it by sheer coincidence. Say, if there were 17 chars instead of 16, but you still cared only about the 16 cases above, the probability of getting one of them would be 16/(17^2).
Admin
I know a lot of students who complain they will never need something as abstract as probability theory in the "real world".
Admin
Linux programmer here. Am I missing something?
Admin
Yes - Windows is what most people use.
Admin
Admin
So they can be used as usernames and passwords (can't be bothered to dig up the WTF where the authentication was effectively by password while the username was ignored, so if the admin rejected your password, you knew somebody else used it).
Admin
Where do you find these people? Seriously.
Admin
Nonsense !
Most programmers are dicky wannabees using and creating non-unique guid creation functions.
Many GUID functions have a logical implementation that can NOT guarantee unicity (yes I'm talking to you all you md5-sha-noobstylers ^^).
Of course Andy here is the leader of the pack, but still most libraries available for guid's aren't any more unique (lol let's base it on oh-so-unique MAC address and stuff...), even those coming from "respectable sources".
Worse than that, most of the fake guid solutions available are slower than the real guid ones --
Admin
Wow, you sure put us in our place, with your incoherent, unsupported hissy fit.
Admin
http://download.oracle.com/javase/1.5.0/docs/api/java/util/UUID.html
This is by design not unique. Is that supported enough for you ?
Admin
No GUID/UUID can possibly be guaranteed to be unique. If you have a specific failing to address with that class, and you can actually explain coherently why it's insufficient for its intended purpose, and what you think is better and why, please do.
So far all you've put forth is a rant and a sound-bite.
Admin
You sir are correct, i shan't be ranting anymore --
Basically, I find it useful to have UUID's that can at the very least be unique in a controlled environment.
As in, being able to have unique id's within a limited set of machines(I picked 32bits for the machine id, carefully distributed by me and intended to be unique, not based on MAC which can/must sometimes be spoofed for other purposes).
In order to achieve that, several parts are required:
-> machine unicity -> thread unicity -> thread time unicity -> inside-thread unicity
The reason I picked those variables is very easy : -> machine id is a preset variable -> thread id is a readily available variable (you can pre-store it @ start of thread) -> thread time unicity (same you can pre-set it @ start) to avoid mixing thread 1234 from 12 o'clock with thread 1234 from 16 o'clock (with the time id which is in milli or microseconds, you can be sure that ending your thread with a very small sleep will prevent any possibility of two threads on the same time point) -> inside thread unicity (counter... of course you always have at least a counter going somewhere for some reason, so might as well use it)
So basically, I picked 4 variables which are almost free to get (machine_id is present even before thread execution, thread_id is present @ thread creation, and thread_time is present @ thread creation too, the counter is almost inevitably present if you're handling several objects) in terms of processing power, but can also guarantee unicity.
This can be guaranteed to be unique, as long as your time is unique (and if it is not, you have so many problems beyond unique identification it's not worth focusing on it -- ), and from my calculations, that type of unique id is safe for the next 20+ years according to processing power growth (although at some point threads_ids might get another digit).
Besides, this method of generating unique id's is orders of magnitude faster, as it requires only very minor cpu activity (no random, no hash, no time variable generation on every id, etc. etc.) -> the most expensive operations have to be on the base16 encode / concat side.
Also, I chose Java on purpose as it's more or less respected as a language, but the implementations in PHP are just as much of a problem.
Admin
Interesting that you're so sure that your ID generator is so superior to those of all the rest of us "dicky wannabes." I too have had to create a unique-ish ID scheme, and it's more than up to the task set for it.
However, given the choice, I'll simply use java.util.UUID.randomUUID(). On my 6-year-old PC it can create over 800,000 IDs per second, which is just over one microsecond per ID. That's certainly "very minor CPU activity" in every context in which I've ever needed IDs. (Perhaps your ID-generating demands have been heavier.) And if you're worried about a collision, then you haven't done the math.
So in all honesty, I still fail to understand what point you're trying to make.
Is it that many developers make bogus assumptions when coding "unique" ID generators? Absolutely.
Is it that most developers are idiots but you're far superior? Not seen one shred of evidence for that yet.
Is it that most ID libraries suck? No support for that either.
Is it that java.util.UUID in general and/or UUID.randomUUID() in particular suck? No evidence of that either,. Of course, that doesn't mean there aren't domains where they're not appropriate. But so what? No tool is a panacea.
Admin
That's highly specious.
Getting the same instruction to be executed within the same milli- or microsecond on more than one box is not that far fetched, especially when you take into account server farms, time servers, cron jobs, and the like. And even with the sleep() call, while less likely still, they're certainly not "guaranteed" not to collide.
Now, it may be that your particular context minimizes these issues to the point where the risk is acceptable, which is fine. Just don't present the approach as if it's "guaranteed" and will work for the general case.
Admin
My point is : The default libraries for UUID do NOT provide UNIQUE ID's and THEY ARE SLOW.
There is NO reason to accept inefficient AND ineffective code when given a better alternative.
Admin
And thus I suggest you re-read my post, which talks about a unique machine ID, NOT based on MAC address (which you sometimes can /have to spoof) but still unique by design (i.e. unique key distribution service).
The sleep call is ONLY there to prevent the following case :
microsecond 0,0: thread fired, thread id set, thread time set, thread count set, action completed, thread killed; microsecond 0,3: thread fired, thread id set (o shit it's the same), etc.
So instead you have this
microsecond 0,0: thread fired, thread id set, thread time set, thread count set, action completed, sleep(1*time granularity), thread killed;
Which implies that even if your computing power was infinite, you'd still be in the next microsecond, thus guaranteeing that the time part is unique.
Admin
In a case with an optimized java software (lol like that even exists) on very powerful hardware, the numbers you think about DO NOT apply.
In the same case, on hardware 10 or 20 years from now, you obviously have no ideas of how "possible" a collision becomes.
So on one hand, you have a solution that guarantees that a collision is IMPOSSIBLE, and it runs faster.
On the other hand, you have a solution that guarantees that p(collision) is QUITE SMALL, and it runs slower.
AND YOU WANT TO PICK THE SECOND ONE BECAUSE ?
i'm not even stating that my prototype uuid is a panacea, i'm just telling you it's a zillion times better than a solution with a non-null p(collision).
If it were my decision, I'd drop the RFC in the trash can, have a retarded machine_id ready for the future (let's give it 64bits), the smallest time fraction available in standard software (nanotime ?), a thread count over 16bits, etc. just to make sure that solution is still valid in 2100.
Admin
Irrelevant. Once it's fast enough, making it go faster adds no value. And it's already way more than fast enough.
I justify it by that fact that since making it faster than fast enough adds no value, then the contrapositive follows--namely, NOT making it faster than fast enough does not diminish the value.
Your library does not provide unique IDs either. java.util.UUID.randomUUID() provides unique values WITH EXTREMELY HIGH PROBABILITY--certainly no worse than your scheme.
And not, it is NOT slow. "slower than the fastest possible" != "slow"
Agree completely. But since java's UUID is more "efficient" than I'll ever need, and plenty effective, neither your code nor mine is a "better alternative", except that mine was better in the situation where I used it because I UUID was simply not an option--for reasons that had nothing to do with its speed or effectiveness.
Admin
Irrelevant. Your claim was that times should never match, and if they do, you have a huge problem. Unique machine IDs do not factor into the fact that that claim is dubious.
They DO factor into making reducing the likelihood of collision of the ID in toto, even if times do match, yes, but I wasn't refuting that, only your claim about time on it own.
So I suggest you re-read my post.
Except that sleep() can wake up spuriously, so there's no guarantee that time has elapsed as far as your code can tell.
Or the system clock could have been adjusted.
So, again, you have no guarantee of unique times.
Admin
Up until now I'd had my doubts about whether you were actually capable of a reasonable discussion, but I was willing to give you a chance. Now I can see you're just a nutter.
Wrong.
Irrelevant.
Buh-bye. Do post again if you decide to live in the real world someday though!