• Biri (unregistered)

    Mmmm, printed memes...

  • Quite (unregistered)

    Shrug. Just one of the things that happens when you just type commands into a box rather than implement a script which bags up all these commands into a nice little automated tool.

  • lurker (unregistered)

    Dvorak to the rescue!

  • ChrisH (unregistered)

    And this, Ladies and Gentlemen, is why you virtualize every single fucking box, even if it ever does only one job.

  • Meh, shit happens. (unregistered)

    He didn't even drop the company's main server to runlevel 1? Just the server he was supposed to reconfigure? A minor slip-up and the 10 minute rant was unwarranted.

    If you hire colo servers then a call to the datacenter to restart one is just a routine thing (or at least it should be). If the datacenter then needs to search for your server to manually reboot it it's maybe time to look for another datacenter to rent your servers from? Come on, that's their main job.

  • Vietcongster (unregistered) in reply to Meh, shit happens.

    Looks like it's a server run by the company itself, not some datacenter, since Sam will be collecting his reward. That should explain (but not justify) that Piece-O'-WTF.

  • Dude (unregistered)

    This feels like it should have been a lot worse.

    • Reconfiguring brand new machines
    • Modular components that can be hosted on any machine
    • Off-hours maintenance for a (seemingly) non-critical system that isn't (according to the story) necessary for at least another week or two

    Even if he wasn't able to get the machine restarted, even if they couldn't provision a new machine... you can just put the missing component(s) on a different (unbroken) machine. You would again have to reconfigure everything (which could cause the issue to happen again), but beyond that I see no real WTF here, unless I'm really missing something.

  • OzPeyter (unregistered)

    Taking down a single server? Mere amateurs. I once saw a colleague take down an entire paper pulp mill with a misplaced keystroke.

    Pulp mills run on steam, lots of steam. And industrial sized steam boilers have to be treated very carefully as they can go boom if mistreated. And even if they don't go boom, taking a boiler off-line loses you a good half day in production .. so big $$$. As such the basic control systems of this particular boiler were running a dual-redundant processor Network 90 system from Bailey Controls. The 2 processors sat side by side in a rack with one being the Master and the other being the hot Stand-by. So that if the Master died then the Stand-by would take over. However there were various other safety devices ensuring that if everything went tits up that there wouldn't be a boom.

    The basic methodology for updating the control software on one of these systems was to take down the Stansd-by CPU, update its software and then restart it. Then you flip the configuration over so that the current Master becomes the Stand-by, and you repeat the update process on that CPU. This was all done manually via a command line.

    Remember how I said the 2 CPU's were side by side in the same rack? Well you addressed the CPUs by their slot number in that rack. So if slot 4 contained the current Master CPU and slot 5 the current Stand-by then you would issue a command like "shutdown 5" to shutdown the Standby. Except that my colleague issued "shutdown 4" by mistake. We were in an office about 500 yards from the plant and the second after he hit enter we could hear the boiler coming down to a crash stop and then lots of silence. Followed by an "oh shit".

    Later on I had an argument with the Network 90 rep. My argument was that if the Master CPU goes down for any reason at all then the Standby should take over, after all it was a dual-redundant system for a reason. His counter argument was "but you commanded the Master to shut down!" . I still believe that he was an idiot.

  • Bert (unregistered)

    TRWTF: using sudo root. It's like taking the blades out of your razor for a closer shave. Anyway, that was one server out of twenty, was it really so bad?

  • Nicholas "LB" Braden (github)

    Yet another reason I don't like short arguments.

  • BugsBunnySan (unregistered) in reply to Bert

    "sudo root" isn't so bad as long as you treat it as what it is: you just cut open the patient's chest and are now operating on their living, beating heart with an extremely sharp scalpel. Maybe not the time to start day dreaming. You can do that after you've sewn them up again and they're still alive, i.e. after you did what you came to do and have dropped root priviliges...

  • t0pC0der (unregistered)

    The real wtf is that I read a whole article that could be summed up as TL;DR- Pressed 1 instead of Q. Should have used deployment script

  • Stephen (unregistered)

    I don't buy it. First, data centers are built with management networks that give remote keyboard and video access to all the servers' consoles, remote power control and remote reboot. (Google "IPMI".) Second, if finding a server is all that difficult, how do they fix hardware failures? Third, this is a single box, not even in production yet, in a fault-tolerant system. Losing it for a day or two is not a disaster.

    Either it's fiction or the WTF is that Tori needs anger management classes.

  • Quite (unregistered) in reply to Stephen

    Probably went: "Ah, you fool. Looks like we'll have to get in touch with our guys in Salt Lake City. Ne'mind, I can handle it - more careful next time, mmm-kay? Or, hey here's an idea, how 'bout a deployment script?"

  • Griwes (unregistered) in reply to OzPeyter

    Yes, that guy was an idiot. Standby taking over no matter what happens is the entire goddamned point of having a redundant machine running as standby...

  • Anonymous') OR 1=1; DROP TABLE wtf; -- (unregistered)

    "sudo root" isn't even a valid command (there's no root(1) command). Should be "sudo -u root" or "su root".

  • z80 (unregistered)

    As i read the comments, it's clear that OzPeyter story is much more WTF material than the article. Perhaps the censored part regarding Tori comportemental issues had could justify a WTF...

  • EatenByAGrue (unregistered)

    It's a good thing the editor remembered proper decorum here. After all, this is a grown-up place, and we wouldn't want to endure the wrath of Bert Glanstron.

  • tim (unregistered)

    sudo -u root is the wrong thing, as well. -u root is default. you think of -s or -i. This makes the whole story look like a fake.

  • Wheel (unregistered)

    Sheesh. I was expecting some actual drama.

    Try doing the same thing to the production servers in Korea. Where the datacenter SLA doesn't actually have on-site personnel during their off-shift. Which was daytime back in the USA.

    Sure, Ashton didn't really dodge a bullet. But it's only a flesh wound.

  • bryan986 (nodebb) in reply to ChrisH

    That or have a lights-out-management capability.

  • Mikey Dread (unregistered)

    Bit of a lame story to be fair. I'm only let down as I normally like Jane's best.

  • Sam (unregistered)

    The story would have been better if it was Sunday in Salt Lake City, or Saturday in Israel, or the entire month of August in France.

  • Quite (unregistered) in reply to Sam

    ... or Bank Holiday Monday in the UK.

  • David Mårtensson (unregistered) in reply to Stephen

    In this case I took it as they had their own servers located in the data center, not rented equipment, and then it's not common practise as the data center usually do not have the manpower to actively manage customer equipment, nor the education for the different hardware and software.

  • JC (unregistered)

    I just came here to say:

    My mother once told me of a place With waterfalls and unicorns flying Where there was no suffering, no pain Where there was laughter instead of dying

    I always thought she'd made it up To comfort me in times of pain But now I know that place is real Now I know it's name

    Sal Tlay Ka Siti

  • BLAKEYRAT (unregistered)

    (spam)

  • I dunno LOL ¯\(°_o)/¯ (unregistered) in reply to Bert

    If he had done "sudo telinit 1" it would have been the same result. I know that sudo can limit which commands you can run, but can it restrict parameters too? And in any case, these were NEW servers, and thus not (yet) subject to the kind of restrictions you would see on a live production server.

    Indeed, TRRWTF was not scripting this shit. Even five servers would have been enough to make it worth it just in time savings, but not having an automated script for twenty was just stupidly wasteful, even without mistakes.

  • Paul M (unregistered)

    and this folks is why you have out of band management consoles on your servers. Even HP iLo without the full licence key can do serial console, and you can use IPMI style management if you hate HP's iLos. Dell Drac too, and of course SuperMicro.

  • Herby (unregistered)

    Some machines have a network connection (separate from the others) that does "front panel" stuff. If you have one of these machines, it makes for a much nicer resolution of the "oops" problems. Otherwise, you need a many mile long screwdriver.

    On a system I designed (back in the late 70's, I had a modem connection that could take control from the console terminal, complete with 'reset' capability. I didn't use it too often, but it was a nice feature. It got really interesting when the remote device echoed characters and the connection was through a satellite link. The delay was around three characters of typing.

  • Tsaukpaetra (nodebb)

    Doing things by hand? Not having remote OOB management tools for servers? What, are they still sending messages via carrier pigeon?

  • Pedantic Jerk (unregistered) in reply to OzPeyter

    So, you take the CPU down...

    Do you leave the power supply and motherboard running?

    And how do you update the software of the CPU?

  • Joe (unregistered)

    no ipkvm?

  • fbmac (nodebb)

    testing

    Addendum 2016-06-08 15:12: .

  • fbmac (nodebb)

    [redacted - bz]

  • Tom (unregistered)

    Did something sort of similar years ago. All of our app servers were Linux, all of our DB's were Solaris. I forget exactly what I was doing, but in switching between terminals I did 'killall process_name' on what I thought was a Linux server. It was not.

  • LK (unregistered)

    Guy has a bunch of servers and crashes one. How is that even a story?

  • Therac-25 (unregistered)

    Mere amateurs.

    Yep.

  • Olivier (unregistered)

    I disagree on the scripting.

    Because each server had a different config, so you would have to write a script that takes a config file, test that the script is working, probably have several places where you do different things... No time saving here and more error prone.

    Doing that on Saturday was stupid, no critical, could be done any weekday.

    Collocating at a remote place with not human presence is certainly stupid (if you don't have your own staff at the colloc place, the staff there must provide the service, the Korean example above is certainly a bad choice).

    sudo will not help, at most it will keep a log, because if you sudo several commands one after the other, it won't request your password each time.

  • Nakke3 (unregistered)

    The real WTF is Sue's poor collaboration attitude.

  • Nakke3 (unregistered) in reply to Nakke3

    *Tori's, not Sue's

  • poniponiponi (unregistered)

    Recently converted - by the Win10 shitfest - linux (mint) user here. It confuses me how *nix guys treat going root as something that should only be attempted by experts after making offsite backups of every machine in the building and telling the fire department to be on standby, yet having users blindly paste .js or .css code from google into system files is viewed as an acceptable mechanism to do basic UI customization.

  • Steve_The_Cynic (nodebb) in reply to OzPeyter

    If your steam boiler can go boom under software miscontrol, then it is an exploding version of the Therac-25. End of argument. Where is the purely mechanical safety valve? You know, the one that opens when the pressure inside goes above X psi, and does it without any software involvement?

  • Chris (unregistered)

    I did admin stuff for a small shop (3 people including me) during my student time. Remote admin the firewall, yeah no problem, just let me restart it. Oh shit, restart parameter not support by init script, ok so stop firewall... and won a 80min trip from university to the company ;-)

    Taught me the lesson to think before hitting enter on production machines.

  • Olivier (unregistered) in reply to poniponiponi

    That the exact reason why we treat going root as something serious: no root, no risk of bad cut/paste into a system file.

    .js or .css should not belong to root, but to some http/www user, so you don't need root access to modify them, and you should not use root level of privilege either.

    That's a philosophy that is not natural for users coming from Windows (because in Windows you have to do everything at administrator privilege, else something will not work) but it is the safe way to do: at a given time, be assign the exact amount of privileges you need to complete the task at hand, so when you make a mistake, it is not critical.

  • Vietcongster (unregistered) in reply to OzPeyter

    i'd bet it was a combination of stupidity and ass-saving for the Network Guy. After all, no one wants to be responsible for shutting down plants, and "commanding the shutdown" might be heard as "malicious safety override" by upper management.

    But your colleague should have the card "Manual Processes are Prone to Error" to counteract, if it comes to that.

  • Could be worse (unregistered)

    I did a chmod 400 -R / was supposed to be a /tftpboot/ but the damn cat jumped on the keyboard

  • TenshiNo (unregistered) in reply to Olivier

    I firmly disagree about having to "run as admin" on Windows. That's a horrible idea, and how people get infected with viruses. Since the days of Windows 7, Microsoft made it pretty simple (baring Vista's incessant prompts) to run as a limited user, and only allow higher permissions at the moment that an app requests them.

    I am certain that there is still a major problem of people clicking "OK" without having any clue what it means.

  • Dr Dolittle (unregistered) in reply to Olivier

    On my workstation I use a regular user, but first thing I do when I ssh somewhere is sudo bash, then I stay root unless I have to ssh somewhere else. It's just absurd to connect to a server and type the same exact thing you'd do as root but start every command with sudo. It's like those "Do you really want to quit" messages you get on lousy apps when you click on the Quit menu. Imaginary safeguard that just gets in the way.

  • Yamikuronue (nodebb) in reply to z80

    "it's clear that OzPeyter story is much more WTF material than the article."

    I agree. I work with what I'm given though -- I'd be happy to receive a submission like that to work on :) If anyone has better WTF stories than the ones you see on the site, please do submit them.

Leave a comment on “A Costly Slip”

Log In or post as a guest

Replying to comment #:

« Return to Article