• bkr (unregistered)

    that's a very very poor story and just states that you should learn the basics of server administration and some math before trying to make big $.

  • Gumpy Guss (unregistered)

    There's the old story of when Werner Von Braun's team needed to ensure that the Apollo rockets were safe and the spec was for "five nines" (99.999% success rate). He asked his top engineers if this was possible and he did get five "Neins".

  • 99,999% (unregistered)

    This comment is 99,9% first.

  • Dan (unregistered)

    Imagine my heart palpitations when salespeople were flanking me (as CTO) in big-dollar meetings, discussing "100% secure".

    I don't think I shat right for a year.

  • (cs) in reply to Dan
    Dan:
    I don't think I shat right for a year.
    Thanks for the visual.
  • (cs)

    We have 99.9% uptime SLA's on certain systems, but there are hot live clustered servers, auto failover, duplicated everything in DR, and so forth.

    How could Gary a) give such an SLA on a single box system, and b) send a part-timer out to fix an essentially dead server without explicitly mentioning the SLA (not that Ivey really had much choice at that point)?

    I hope MediumCo sues Gary's a** out of business!

  • nitehawk (unregistered) in reply to Gumpy Guss
    Gumpy Guss:
    There's the old story of when Werner Von Braun's team needed to ensure that the Apollo rockets were safe and the spec was for "five nines" (99.999% success rate). He asked his top engineers if this was possible and he did get five "Neins".

    Except that the rocket was the Saturn V, and that it did indeed have a 100% success rate. I believe it is the only major rocket in history to achieve this.

    The Apollo capsule did have a couple of spectacular failures, however.

  • Anonymous (unregistered)

    Poor Gary, IT just isn't the industry for him.

  • (cs)

    And yet again, another crooked snake oil salesman who knows nothing trying to pretend he's a business owner, but not spend money to ensure a proper environment. The words "cobbled together workstation running Linux" earlier in the story says it all.

    I hope this jackass got sued into oblivion; he deserves it. Not that he didn't TRY by having a real server, but a server without proper support does nothing.

  • Ian Tester (unregistered)

    The Real WTF is that Ivey is spelt "Iven" one time in this story.

    Oh, and who uses VNC to remotely operate a Linux server? What's wrong with good ol' SSH?

  • (cs)

    See, if your customers are dyslexic, it's much easier to sell them on a nine-fives SLA (55.5555555%! amazing!) than a five-nines SLA... easier to uphold, too.

  • Georgem (unregistered)

    So what "immediate resolution" did the VP of MediumCo expect, exactly? Since time travel is either not possible, or not readily available, the only resolution open here is to retroactively lower their uptime expectations

  • wiregoat (unregistered)

    Were you expecting 99.99% or 9.9% always funny WTF worthy stories?

  • (cs)
    Connected via VNC, Ivey immediately discovered that there were a bunch of connections that weren't closing as expected. The automatic updates were fine, except for a queue of kernel updates that were downloaded but not installed. This was expected, as those required a reboot, but it was clear that they were the source of the TCP/IP stack errors. From the comfort of his computer chair in his den, he rebooted the machine... only the machine didn't come back online.
    1. In recent version of kernel not every kernel update needs restart (but OK - they've might used old stable release). Still probably kernel updates were needed as such machines tends to have a security updates only (or am I wrong it this case?).
    2. Why reboot the machine. I'm not a wizard but I'd restart automatic updates. It's not hard to find out which process it is.
    3. I'd like to know a distro - to avoid it.
    4. Well - why to update grub at all? Provided it does not have security issues it should be left intact. In such case if you have problems with new kernel you fallback to old one. It's not so hard. I'm doing it on my non-production computer.
  • katre (unregistered) in reply to Georgem

    When business types ask for "immediate resolution", what they really mean is "I want my money back. And some extra money, besides. Hmm, how's your kidney? I want that, too."

  • Wayne (unregistered) in reply to Ian Tester

    That, plus this sentence:

    "But after been fighting a long, uphill battle against some larger competitors"...

  • (cs)

    Looks like they had four-nines uptime for sure. Except when I went to school I learned that four nines are thirty-six…

  • Michael (unregistered) in reply to Gumpy Guss
    Gumpy Guss:
    There's the old story of when Werner Von Braun's team needed to ensure that the Apollo rockets were safe and the spec was for "five nines" (99.999% success rate). He asked his top engineers if this was possible and he did get five "Neins".

    This comment wins the thread, laughing hard at the five "Neins".

  • morry (unregistered)

    as someone who works on a 99.9% uptime system, with SLAs, let me tell you how it really works.

    #1) it's 99.9% SCHEDULED uptime. we have regularly scheduled downtimes for maint. #2) we have at least 2 other different systems to catch downstates if the main system goes offline. By different I mean they do the basic function, but not anywhere near as in-depth as the main. #3) SLAs are graded. you miss 99.9% and hit 99.8%, it's a minor infraction. You hit 80% and you're in trouble.

  • Bri (unregistered)

    There's always the classic response. "You're right, you had 2% downtime this month. Here's a 2% refund."

  • Seth Rightmer (unregistered)

    TRWTF is definitely Ivey's idiotic automatic patch policies on a production server. Who automatically updates a server without testing the patches first? And kernel patches? Really now. Plus, no remote console at a site that's a two hour drive away?

    The boss was just being a normal, clueless boss, the IT guy screwed up.

  • (cs)

    I always love the obfuscated company names! You must be geniuses to come up with that stuff!

  • (cs) in reply to Seth Rightmer
    TRWTF is definitely Ivey's idiotic automatic patch policies on a production server. Who automatically updates a server without testing the patches first? And kernel patches? Really now. Plus, no remote console at a site that's a two hour drive away?

    The boss was just being a normal, clueless boss, the IT guy screwed up.

    I thought about remote console but well - if the offsite backup == basement he might have funds. Anyway - AFAIK there as still 2.2.x servers - they just work. No need to update as You said.

  • Jay (unregistered)

    Seems to me the real question is, What remedies does the contract provide for failure to meet the 99.99% uptime requirement? If it doesn't say that in that case they'll give you a month's free service or whatever, I wonder what, legally, the other party can do. Like if I buy a car that's advertised to get 35 miles per gallon, and I find that in fact I only get 34 miles per gallon, frankly I doubt I'd get anywhere in a lawsuit. Maybe I could demand my money back. I'd be surprised if a judge awarded me big bucks in a lawsuit over such a thing. Of course, judges do some crazy things, like giving people millions of dollars because they burned themselves spilling coffee in their own lap.

  • I note that it didn't say "99.99% per year" (unregistered)

    StruggleCo might be smarter than they look. So they guaranteed 99.99% uptime... but I don't see that they guaranteed it on a per-year basis. As long as it eventually averages out, they should be fine. :P

  • Jay (unregistered)

    This does remind me of a former employer who signed a contract with a big customer to sell them one of our software packages, with a clause in the contract saying that we would make any change to the software that they requested, at any time, for no additional charge.

    I pointed out to the boss that this was rather open-ended. They could demand changes that would require thousands of hours of programmer time. He replied that we were getting several hundred thousand dollars for this contract, so it was worth it. I said that if we got $300,000 but had to do $400,000 worth of work, we weren't going to make money. He looked at me like I was an idiot and asked if I REALLY thought that we should pass up a several-hundred-thousand-dollar contract. We circled around on this a few times until we both walked away convinced the other person was nuts.

    That company is bankrupt now. I can't imagine why.

  • CoyneT (unregistered) in reply to snoofle
    snoofle:
    How could Gary [...] send a part-timer out to fix an essentially dead server without explicitly mentioning the SLA (not that Ivey really had much choice at that point)?

    That's not even the best part: The SLA was violated (by over 100%) by the time Ivey drove to the data center (which was 2 hours away).

    No doubt, "It's Ivey's fault," because he didn't buy (and use) a Learjet on his part time wage.

  • (cs) in reply to nitehawk

    With 13 launches (according to the Wiki), how would you know the difference between a 90% success rate and 100%? Or even 80%, for that matter.

  • djeidot (unregistered)

    Theoretically, if you'd waited five or six years and had absolutely no downtime during this time, the 99,99% uptime percentage could be still accomplished...

  • Space Cowboy (unregistered) in reply to nitehawk
    nitehawk:
    Gumpy Guss:
    There's the old story of when Werner Von Braun's team needed to ensure that the Apollo rockets were safe and the spec was for "five nines" (99.999% success rate). He asked his top engineers if this was possible and he did get five "Neins".

    Except that the rocket was the Saturn V, and that it did indeed have a 100% success rate. I believe it is the only major rocket in history to achieve this.

    The Apollo capsule did have a couple of spectacular failures, however.

    If you consider Apollo 6 that nearly destroyed itself due to pogo oscillations and Apollo 13 that had a center engine cut-off on the second stage (due to similar issues) and several other issues "100% success".

    Truth is, if you fly something only 13 times, you're likely to beat the odds.

    Remember, the shuttle flew more successful flights up until Challenger than the Saturn V ever flew.

    Don't cherry pick data. Otherwise you can have 100% uptime, until you're down.

    One metric that I heard did come from the Apollo program was that every 9 doubled costs.

    Want 90% reliability, it costs X. 99% costs 2x 99.9% 4x etc. And as a first metric I've found that very reasonable when it comes to datacenters and the like.

  • (cs) in reply to Space Cowboy
    Space Cowboy:
    One metric that I heard did come from the Apollo program was that every 9 doubled costs.

    Want 90% reliability, it costs X. 99% costs 2x 99.9% 4x etc. And as a first metric I've found that very reasonable when it comes to datacenters and the like.

    What if you want 9% reliability?
  • (cs) in reply to I note that it didn't say "99.99% per year"
    I note that it didn't say "99.99% per year":
    StruggleCo might be smarter than they look. So they guaranteed 99.99% uptime... but I don't see that they guaranteed it on a per-year basis. As long as it eventually averages out, they should be fine. :P
    That's what I thought. As there's no limit stated, every hour of downtime means that MediumCo will stay with them for 416 days and 15 hours more. If MediumCo someday decides not to keep their service, then it's not StruggleCo's fault that they won't provide the 200 years of uptime that are due.

    And hooray to the return of featured comments

  • Gumpy Guss (unregistered) in reply to nitehawk

    The only way to prove a 99.999% success rate is to send up 100,000 rockets and get only one failure.

    The Apollo launch record only suggests a 95% or better record. A long way from 99.999%.

    Actually there were a couple serious problems that led to one engine shutdown on an early flight. Just no spectacular explosions.

  • Alin (unregistered) in reply to Bappi
    Bappi:
    Space Cowboy:
    One metric that I heard did come from the Apollo program was that every 9 doubled costs.

    Want 90% reliability, it costs X. 99% costs 2x 99.9% 4x etc. And as a first metric I've found that very reasonable when it comes to datacenters and the like.

    What if you want 9% reliability?

    Then it will be FREE because you would probably be doing all the testing for the product and people would get a good laugh of you trying to use it.

  • The Stainless Steel Hankie Of Justice (unregistered) in reply to Jay
    Jay:
    Of course, judges do some crazy things, like giving people millions of dollars because they were hospitalized for eight days with third degree burns for spilling coffee that would cause a full thickness burn to human skin in two to seven seconds in their own lap.

    FTFY

    http://www.lectlaw.com/files/cur78.htm

    P.S. How bad is a third degree burn? Here's a hint: there's no such thing as a fourth degree burn.

  • (cs) in reply to Bappi
    Bappi:
    Space Cowboy:
    One metric that I heard did come from the Apollo program was that every 9 doubled costs.

    Want 90% reliability, it costs X. 99% costs 2x 99.9% 4x etc. And as a first metric I've found that very reasonable when it comes to datacenters and the like.

    What if you want 9% reliability?

    I'm going to flex my math muscles and say it's 1/2X.

  • (cs)

    If only he had bought two servers - as we all know* 100% is just 50% twice... :(

    • for various values of "we all" not actually including me

    np: Underworld - Dirty Epic (DubNoBassWithMyHeadMan)

  • Jon E. (unregistered)

    Not even Viagra claims that much uptime.

  • (cs) in reply to The Stainless Steel Hankie Of Justice
    The Stainless Steel Hankie Of Justice:
    there's no such thing as a fourth degree burn.

    not to be pedantic (ok ok to be pedantic) but yes there is (note: google image results are not what you want :-)

  • Anon (unregistered) in reply to Bappi

    Thats what, Twitter?

  • (cs)

    If you promise 4 9's, you have to have a failover machine. You HAVE TO HAVE a failover machine. Setting up failover isn't hard, and it isn't that expensive, unless you're promising a level of throughput. Put the disks in an external enclosure, hook two machines up to the enclosure, and set the system to automatically route to the second machine if the first one stops responding.

    You also need to do occasional maintenance. It was auto-patching? Seriously? And this was considered to be a good idea? You're betting your contract on a patch installing automatically and not breaking anything? And you're auto-patching the fricking KERNEL?! They'd have been much better off not patching at all.

    Frankly the WTF is that these people thought they could get away with offering 4 9's in a contract and not putting any real money or effort into it. If some company NEEDS 99.99% uptime, they're going to notice when they don't get it.

  • (cs) in reply to Anon
    Anon:
    Thats what, Twitter?

    Yea right. Twitter, as a service, is hanging on to 2 9's uptime by the skin of its teeth.

  • (cs) in reply to akatherder
    akatherder:
    Bappi:
    Space Cowboy:
    One metric that I heard did come from the Apollo program was that every 9 doubled costs.

    Want 90% reliability, it costs X. 99% costs 2x 99.9% 4x etc. And as a first metric I've found that very reasonable when it comes to datacenters and the like.

    What if you want 9% reliability?

    I'm going to flex my math muscles and say it's 1/2X.

    But every nine doubles costs. Since 90% has the same number of nines as 9%, you're saying that X = 1/2 X.

    And 9% reliability isn't impossible. Imagine you have a process that is extremely profitable, but unreliable. If you have only, say, 7% reliability, a small increase to 9% may be very very good for the bottom line.

  • Matt S (unregistered)

    Technically, he's lucky he didn't guarantee 99.(999)% uptime, because that's technically 100%.

    http://en.wikipedia.org/wiki/0.999...

  • Matt (unregistered)

    When do we talk about global warming?

  • (cs) in reply to Jon E.
    Jon E.:
    Not even Viagra claims that much uptime.
    Consult your Doctor if you experience an errection accounting for more than 99% uptime...
  • (cs)

    HP NonStop!

  • Matt (unregistered) in reply to Michael
    Michael:
    Gumpy Guss:
    There's the old story of when Werner Von Braun's team needed to ensure that the Apollo rockets were safe and the spec was for "five nines" (99.999% success rate). He asked his top engineers if this was possible and he did get five "Neins".

    This comment wins the thread, laughing hard at the five "Neins".

    Subtlety is not lost on you.

  • (cs)

    Six nines is never down!

    9x9 + 9+9 + 9/9 = 100

  • Matt (unregistered) in reply to Jay
    Jay:
    Seems to me the real question is, What remedies does the contract provide for failure to meet the 99.99% uptime requirement? If it doesn't say that in that case they'll give you a month's free service or whatever, I wonder what, legally, the other party can do. Like if I buy a car that's advertised to get 35 miles per gallon, and I find that in fact I only get 34 miles per gallon, frankly I doubt I'd get anywhere in a lawsuit. Maybe I could demand my money back. I'd be surprised if a judge awarded me big bucks in a lawsuit over such a thing. Of course, judges do some crazy things, like giving people millions of dollars because they burned themselves spilling coffee in their own lap.

    That initial amount was later reduced to a few tens of thousands or a few hundreds of thousands. FYI.

Leave a comment on “Infinity Nines of Uptime”

Log In or post as a guest

Replying to comment #:

« Return to Article