The Daily WTF: Curious Perversions in Information Technology

2024-07-11 Reply Admin

Apaprently downtime isn't a major issue, so leave one server unplugged.

2024-07-11 Reply Admin

Time for a variant on the STONITH algorithm. Get a couple network connected power strips. Find some hookpoint when the backup sever takes over ... and use it to "Shoot The Other Node In The Head."

2024-07-11 Reply Admin

Excellent advice. Or use it to boot the fail-over server.

That'll be $100k.

2024-07-11 Reply Admin

Yeah that's what I was actually going to say -- if the failover system is causing more faults than it's preventing, just keep that spare server off. I also would wonder why they'd lose connectivity -- maybe they just needed a better switch or cabling?

2024-07-11 Reply Admin

My experience with clustering says that if you think clustering is the solution to your reliability problems, you are almost certainly wrong. The myriad (and weird!) ways that I've seen clustering setups fail would boggle your mind. It got so bad at one point that I was thinking, "Just have a single server and reboot it when it fails rather than trying to have automatic flail over" (note: "flail" is not a typo.)

Even STOMITH has its problems and failure modes; sure, it helps with the split brain scenario posited here, but if it's not properly configured, you can get all sorts of fun failures that are tricky to diagnose.

I get why clustering is a thing. I even get that there are times when it's the best option. I just despair at the crazy number of ways that this stuff goes pear shaped without easy fixes - especially if it's not set up with due care and attention in the first instance.

2024-07-17 Reply Admin

3 times a month means you're not learning from these incidents. Connect remotely and shut down one of the servers just being one of the options.

Classic WTF: Cluster#$%&

Leave a comment on “Classic WTF: Cluster#$%&”