- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Apaprently downtime isn't a major issue, so leave one server unplugged.
Admin
Time for a variant on the STONITH algorithm. Get a couple network connected power strips. Find some hookpoint when the backup sever takes over ... and use it to "Shoot The Other Node In The Head."
Admin
Excellent advice. Or use it to boot the fail-over server.
That'll be $100k.
Admin
Yeah that's what I was actually going to say -- if the failover system is causing more faults than it's preventing, just keep that spare server off. I also would wonder why they'd lose connectivity -- maybe they just needed a better switch or cabling?
Admin
My experience with clustering says that if you think clustering is the solution to your reliability problems, you are almost certainly wrong. The myriad (and weird!) ways that I've seen clustering setups fail would boggle your mind. It got so bad at one point that I was thinking, "Just have a single server and reboot it when it fails rather than trying to have automatic flail over" (note: "flail" is not a typo.)
Even STOMITH has its problems and failure modes; sure, it helps with the split brain scenario posited here, but if it's not properly configured, you can get all sorts of fun failures that are tricky to diagnose.
I get why clustering is a thing. I even get that there are times when it's the best option. I just despair at the crazy number of ways that this stuff goes pear shaped without easy fixes - especially if it's not set up with due care and attention in the first instance.
Admin
3 times a month means you're not learning from these incidents. Connect remotely and shut down one of the servers just being one of the options.