• Grumposaur (unregistered)

    Apaprently downtime isn't a major issue, so leave one server unplugged.

  • Kevin (unregistered)

    Time for a variant on the STONITH algorithm. Get a couple network connected power strips. Find some hookpoint when the backup sever takes over ... and use it to "Shoot The Other Node In The Head."

  • Hanzito (unregistered) in reply to Kevin

    Excellent advice. Or use it to boot the fail-over server.

    That'll be $100k.

  • hwertz (unregistered) in reply to Grumposaur

    Yeah that's what I was actually going to say -- if the failover system is causing more faults than it's preventing, just keep that spare server off. I also would wonder why they'd lose connectivity -- maybe they just needed a better switch or cabling?

  • Stuart (unregistered)

    My experience with clustering says that if you think clustering is the solution to your reliability problems, you are almost certainly wrong. The myriad (and weird!) ways that I've seen clustering setups fail would boggle your mind. It got so bad at one point that I was thinking, "Just have a single server and reboot it when it fails rather than trying to have automatic flail over" (note: "flail" is not a typo.)

    Even STOMITH has its problems and failure modes; sure, it helps with the split brain scenario posited here, but if it's not properly configured, you can get all sorts of fun failures that are tricky to diagnose.

    I get why clustering is a thing. I even get that there are times when it's the best option. I just despair at the crazy number of ways that this stuff goes pear shaped without easy fixes - especially if it's not set up with due care and attention in the first instance.

  • Jeremy (unregistered)

    3 times a month means you're not learning from these incidents. Connect remotely and shut down one of the servers just being one of the options.

Leave a comment on “Classic WTF: Cluster#$%&”

Log In or post as a guest

Replying to comment #:

« Return to Article