• (cs) in reply to mikko
    mikko:
    This has happened to us in less dramatic ways with regular Linux/Intel servers when the uptimes were 'too long'. We've had 400+ days uptimes, during which a number of new services were put on and configured on some systems. Along comes an hardware failure or an electrical blackout compounded by some UPS problem, the machine reboots and does not come up quite the way it was supposed to. Someone wrote an init script for some service or another and never tested rebooting on the actual system (because it was in production), with its dependencies and other services.

    Our work is mostly not that time-critical, we can afford an hour or two downtime for debugging the booting. Problems like this just show that uptime should not be an end in itself. Not that it ever was for us, we just didn't bother rebooting. Rebooting every now and then is good for your, if not for getting in a new kernel, then at least for checking that rebooting works.

    OK, I'm going to summarize this as succinctly as possible.

    Uptime doesn't matter. Donwtime does.

    Which would you rather have: a machine that stays up for forty days and reboots reliably in thirty seconds (five nines), or a machine that stays up for four hundred days and reboots, with 95% reliability, in four hours (somewhat fewer nines)?

    The latter might be impressive; but in ahem mission-critical systems, it's a bit of a bust.

    It's all about downtime.

  • Grig (unregistered)

    Back when I worked for AOL, we had some old, old, old, old Netscape equipment that had uptimes in 5 and 7 years respectively. They were old HPUX machines, and were critical for some Netscape 4.5-related issues (I forgot, but even when Netscape 7.0 came out, we had over 15,000 Netscape 4.5 users still). I don't know what would have happened if these machines died, but was told "it would be bad." I managed them as part of a "give the new guy the stuff we don't want" collection (which also included DMOZ). There was no documentation and I didn't have root access. The funny thing is NOBODY had root access to any box! As part of their "multi-turnkey" security, the only way to get root access to a box was a set 15 minute window through an OOB. The root passwords were scrambled by an automatic process every 15 minutes, and you had to jump through hoops to get whatever the scrambled password was in some 15 minute window.

    So they were unpatched, running old several old SCSI disks. No backups. The consensus was we were afraid any shut down and the disks would not spin up anymore.

    That was back in 2005. I wonder if they are still going?

  • Steve Nuchia (unregistered) in reply to Kuba

    [quote user="Kuba"][quote user="Martin"] This is akin to labeling the "OFF" switch with an "ON" label and blaming the user for not looking in the manual. [/quote]

    When office equipment started coming out with "world" rather than "US" markings I went to do something with a PC and there was the 0/1 switch... Except they used stylized graphics so it was really a circle / line switch. Which is which? Open eye = on, closed eye = off was my initial reading. Wrong! Progress in ergonimics: cross-culturally counterintutive markings on critical operating controls.

  • cmb (unregistered) in reply to FlyboyFred

    A clbuttic mistake.

  • seasoned (unregistered) in reply to nisl

    Actually, you can NEVER really "test M/S windows scripts". Dlls can change at any moment and the whole system could change. So windows has all the problems that they said the tandem has AND THEN SOME! And most computers measure uptime in YEARS! UNIX systems do, I understand mainframes do, etc... Disks take at LEAST an average of 3 years, although I have had many last over 10! So it is WINDOWS that gets the blame for crashes!!!!!

  • alonzo meatman (unregistered)

    The REAL WTF is that Chris's company did business with apartheid-era South Africa.

  • fufu (unregistered) in reply to nisl

    Flipping the power switch on that Tandem for a few seconds would not require a complete restart. Tandem machines (still) do a power-fail restart and continue where they were interrupted. Thus, after power on, that machine, its database and application very likely were in perfect shape, without the need for a "Cold-Load" (or IPL, or Ctrl-Alt-Del, reboot); or a need to run the broken scripts. It may have lost its network connections to the ATMs (if that was based on tcp/ip), but otherwise no damage done.

  • Jasmine (unregistered)

    www.vgoldseller.com is an professional store for runescape gold,items,money,accounts,powerleveling,questqoint,runes and some other goods with fast delivery and world ...

  • NHL jerseys (unregistered)

    sadasdasdad http://www.ecforshop.com/

  • My2Cents (unregistered)

    Old thread, found in a search result. Any extreme reliability design would take into account PERSISTENT storage. Storing code, procedures, values, in volatile RAM (no matter how many standby CPUs are present) is NOT persistent, period. Doesn't matter how many UPS backup units etc. you have, batteries fail, wiring fails, circuit breakers fail, power contacts fail etc. I've written over a million lines of code in my lifetime and I can tell you that potential power interruption, at ALL levels must be considered and is part of design reliability. The operating system, software and other scripts must all rely on PERSISTENT redundant storage.

Leave a comment on “Designed For Reliability”

Log In or post as a guest

Replying to comment #:

« Return to Article