• hmmmm (unregistered)

    Nagravision?

  • Anonymous (unregistered)

    A hundred thousand dollars??? I wonder how many seats of VMWare Infrastructure that would buy. Plenty, I would have thought.

  • Matthias (unregistered)

    Perhaps if Massimo could properly maintain his network, the application wouldn't need to lose its connections?

  • Nodody (unregistered)

    So, why exactly do the servers momentarily lose their connection to one another? I agree it stinks that the software completely fails if they do, but couldn't that be made moot by having a more reliable connection between the servers?

  • Irish Lover (unregistered)

    Yay Irish girl is back :o)

  • (cs)

    This is a redundant comment.

  • Some shmoe who had to write such 'cluster' (unregistered)

    Seriously, Nagravision or IDC.

  • (cs) in reply to Charles400
    Charles400:
    This is a redundant comment.
    Not yet it isn't.

    This is a redundant comment.

  • Steve (unregistered) in reply to Anonymous
    Anonymous:
    A hundred thousand dollars??? I wonder how many seats of VMWare Infrastructure that would buy. Plenty, I would have thought.
    I was also thinking that this architecture is begging to be virtualized. That $100,000 would have gone a hell of a lot further than it did in the article.
  • Some shmoe who had to write such 'cluster' (unregistered) in reply to Nodody

    Oh, you just don't understand 'enterprise' grade hardware and software.

    For you see, you take anything that's available to end user customer, mark it up 500-1500%, hit it with a hammer sometime, if cables, tie a couple of knots in them. Then you sell it to 'enterprises'.

    As for software.

    If there is a simple coloration between problem <-> solution, you must take at least 4 detours, get on a bus with a floppy, upload it via satellite, change byte ordering a couple of times and then you may be approaching the 'enterprise' grade solution for the problem.

  • My Name (unregistered)

    [quote=Alex]Redundancy Manger[/quote]

    Intentional? With my luck, it was.

  • Sebastian (unregistered)

    Sure they didn't use the fancy-schmancy Windows functionality? It used to do the exact same thing, which was fun when two boxes were connected to the same storage system...

    If shared hardware is used in a cluster, the boxes need a way to make sure the other box can't access the hardware or these things happen no matter who made the software.

  • Crash Magnet (unregistered)

    This problem cries out for a retired 386 PC with a CD-ROM drive programed to unload the CD if the "enterprise" software fails.

    Or, does that solution only work for unlocking the security doors?

  • (cs)

    OpenVMS introduced clustering in 1982 and yet most of the industry still can't figure it out. What is the point of the "Redundancy Manager" if it's not being used to determine quorum?

  • (cs)

    Since the fix involved unplugging network cables and restarting apps:

    Replace hub with some sort of fancy managed switch that can electronically 'unplug' a computer.

    Use some sort of monitoring software (maybe Nagios?) to detect the problem, and disconnect the computer, then run a script to do the restarting thingy.

    Indeed, Massimo has an inability to administer these servers.

  • Chris (unregistered)

    The real WTF here as that he has to drive to the office to work on this.

  • Joshua Ochs (unregistered)

    Ah, split-brain - the bane of any cluster setup. There are many good ways of handling it - redundant heartbeats, quorum drives, etc. And then there are bad ways - driving into the office to plug and unplug network cables.

  • Salmonymous (unregistered)

    This is commonly referred as a "split-brain" problem. Apparently ,testing was not the priority for such a high price. Nothing to see here, move along.

  • (cs) in reply to fennec
    fennec:
    Charles400:
    This is a redundant comment.
    Not yet it isn't.

    This is a redundant comment.

    As the Master Comment, I feel that you should just both duke it out for commenting resources.

  • (cs)

    So there was only one server with the Redundancy Manager software on it?

    Doesn't sound very redundant to me.

  • DrFloyd5 (unregistered)

    The real solution is to just leave the network cable unplugged.

  • Downfall (unregistered) in reply to m0ffx
    m0ffx:
    Since the fix involved unplugging network cables and restarting apps:

    Replace hub with some sort of fancy managed switch that can electronically 'unplug' a computer.

    Use some sort of monitoring software (maybe Nagios?) to detect the problem, and disconnect the computer, then run a script to do the restarting thingy.

    Indeed, Massimo has an inability to administer these servers.

    I came here to say this. With a small budget and a couple of days, this problem could easily be solved. Of course, that still doesn't address the underlying problem of why this happens so much...

  • (cs) in reply to Salmonymous
    Joshua Ochs:
    Ah, split-brain - the bane of any cluster setup. There are many good ways of handling it - redundant heartbeats, quorum drives, etc. And then there are bad ways - driving into the office to plug and unplug network cables.
    Salmonymous:
    This is commonly referred as a "split-brain" problem. Apparently ,testing was not the priority for such a high price. Nothing to see here, move along.

    Now we have a redundant comment.

  • Steve (unregistered) in reply to m0ffx
    m0ffx:
    Since the fix involved unplugging network cables and restarting apps:

    Replace hub with some sort of fancy managed switch that can electronically 'unplug' a computer.

    Use some sort of monitoring software (maybe Nagios?) to detect the problem, and disconnect the computer, then run a script to do the restarting thingy.

    Indeed, Massimo has an inability to administer these servers.

    A virtualized infrastructure could achieve all of these things without requiring any additional hardware or software. Of course, it would also render the "redundancy manager" redundant so all the extra hoop-jumping becomes irrelevant. I wouldn't say that I'm a massive proponent of vitualization but as I said before, this architecture is absolutely begging for it.

  • (cs)
    Right around 4:45am, Massimo pulled into the TV station's parking lot and was in the server room by 5am. Sleepily, he walked up to the network hub and unplugged the network cable on one server to stop the bleeding
    Two words: VPN
  • (cs) in reply to fennec
    fennec:
    Charles400:
    This is a redundant comment.
    Not yet it isn't.

    This is a redundant comment.

    A Redundant comment on a redundant comment must be redundantly redundant.

  • Jim (unregistered) in reply to Code Dependent
    Code Dependent:
    Two words: VPN
    Umm, I think you'll find... oh, never mind.
  • (cs) in reply to Jim
    Jim:
    Code Dependent:
    Two words: VPN
    Umm, I think you'll find... oh, never mind.

    Two words: Wow.

  • James (unregistered)

    Not related to the article, but "We found an Irish girl to hold the book" is full of win.

    (if you don't get what I'm talking about, whitelist this site on Adblock like a good little forum monkey)

  • Chris (unregistered)

    The proper solution here would be to use a tried and true method of load balancing services like LVS (Linux Virtual Server). LVS can load balance any TCP/UDP service and do so with proper failover.

    I have a web cluster running that has 2 machines currently serving HTTP and HTTPS requests for our web developered application (we never had enough load to scale up to more nodes). Then there are two load balancer boxes running keepalived (which hooks into LVS), constantly monitoring each other and the boxes which run the HTTP/HTTPS service.

    The main difference is that keepalived works in a sane way and in over a year of having this solution deployed, it hasn't gotten all mucked up ONCE. I generally go in and test it once ever few month (unplug the network cable from one of the load balancers and make sure that the other one picks up the service IP and service isn't lost, unplug the web servers from the network and make sure they are taken out of the pool of available servers to send requests to, make sure the primary load balancer takes over again when it gets its network connection back, make sure servers are added back into the cluster automatically once they get network back).

    Everything just works. The WTF here is using proprietary Windows crap for mission critical installations.

  • Carlos92 (unregistered)

    I don't remember the name of the product - it was something that mirrored the server's memory so that if one server went down you wouldn't notice. It was funny to see the two mirrored screens :-) That is...until the fiber between the servers was diturbed and the servers stopped seeing each other, and at best one of them got completely corrupted. Don't ask about the worst that could happen :-(

  • Anon (unregistered) in reply to fennec
    fennec:
    Charles400:
    This is a redundant comment.
    Not yet it isn't.

    This is a redundant comment.

    Not yet it isn't.

    This is a redundant comment.

  • (cs) in reply to Chris
    Chris:
    The real WTF here as that he has to drive to the office to work on this.

    Yup. Just ssh into the remote box and use the "unplugNetworkCable" command, and go back to sleep...

  • Procedural (unregistered)

    Agreed with previous commentators: fix the network. If that's not enough, write some code to shut down the network connection of Server A when Server B's traffic spikes (or touches certain resources indicating that it is live)

  • bored (unregistered)

    TRRRWTF is that they are using hubs.

  • Anonymous Coward (unregistered)
  • Massimo (unregistered)

    Ok, as usual Alex spiced up the thing a bit, so maybe some clarification is due ;-)

    First of all, the whole system (which was not responsible for actual video broadcasting, but only for conditional access management, i.e. pay per view) was sold as a "black box" from its the vendor to this TV station, and no one could modify, or even question, its architecture; they asked for 4 Windows 2003 servers, installed their softwares on them, set up various other appliances (multiplexers, encoders, crypto devices, etc.), and then left all of this to us to manage, with the agreement that if we changed anything in the setup, they weren't going to support it anymore; so, even if there were lots of ways to "do it better", we just couldn't do anything.

    Second, the problem didn't actually happen so often: luckily it only happened one time, when a network cable got accidentally unplugged and we discovered how wonderfully the system was "handling" the situation.

    Third, some more details on the system's architecture: it consisted of two clustered SQL Servers (yes, a plain old Windows cluster, so they actually knew how to make one...) and two "clustered" application servers, which ran the main application and were tied together by the Redundancy Manager; this plain Windows program (which amongst the other things had to be run in a GUI session) was in charge of starting and stopping the main application services and failing them over when needed, but, as we now already know, it failed miserably if the heartbeat connection was lost, leading to two active nodes which got seriously angry at each other; but the main WTF was the absolute inability of the RM to solve this situation even if the heartbeat connection went back on again: the two nodes would just keep fighting until someone stopped both manually, and then started again only one of them.

    Lastly, VPN: there was no one (due to "security" reasons), and although I suggested implementing it, management just didn't like the idea.

    And oh, yes, the system was priced really high.

  • Massimo (unregistered)

    Regarding the vendor: don't want to make names, but maybe you can look for a company based in the Netherlands which sells systems for conditional access managements in digital TV...

  • Daniel (unregistered)

    Where did I see that before... mmmm.... mmmm.... Aha! I know! Every fucking place I ever worked at!

    Man, there are advantages to a big corporation career, but it does get you down if you have low tolerance for stupidity.

  • PG (unregistered) in reply to Salmonymous
    Salmonymous:
    This is commonly referred as a "split-brain" problem. Apparently ,testing was not the priority for such a high price. Nothing to see here, move along.

    Yes and as another poster has said VMS had ways to take care of this way back in the 80s, and without the stupid Linux Clustering "Shoot the other guy in the head" way of thinking.

    So then you get Oracle's RAC and they think, "Oh my it's a cluster I have to determine if it is split brain and reboot machines and what not."

    No Oracle, leave that stuff to the OS and the cluster sub-system, stop trying to understand the hardware and give me a better database.

  • mentor (unregistered)

    mmm, split brainnnnssss

  • (cs)

    Windows Clustering is an oxymoron. A real failover cluster should never have that sort of conflict.

  • The Badget (unregistered) in reply to mentor
    mmm, split brainnnnssss

    The key to split brains, you see, is soaking them overnight.

  • Art Critic (unregistered) in reply to Anonymous Coward
    Anonymous Coward:
    That's not Art Vandelay !!!
  • Morry (unregistered)

    I don't understand why massimo (or is this Alex's edits?) would argue what the supplier will and won't do. That's management's job to go back to the supplier and tell them their setup isn't working and they need to fix it.

  • anon (unregistered) in reply to Massimo
    Massimo:
    Lastly, VPN: there was no one (due to "security" reasons), and although I suggested implementing it, management just didn't like the idea.

    That alone is reason to quit.

    We should have a WTF one day that is just...

    "So Eddie started working at this new company and then discovered that they didn't allow VPN so he had to drive in any time there was a problem. The End."

  • anon (unregistered) in reply to Art Critic
    Art Critic:
    Anonymous Coward:
    That's not Art Vandelay !!!

    And you want to be my latex salesman....

  • clickey McClicker (unregistered) in reply to Anonymous Coward
    Anonymous Coward:

    I admit I am not fully versed on the "irish girl" thing, it was before my time here, so I am left to wonder if this is the original irish girl or what? Or just some busted tees girl.

  • Loraxxarol (unregistered)

    PolyServe?

  • Coward #2 (unregistered)

Leave a comment on “Cluster#$%&”

Log In or post as a guest

Replying to comment #:

« Return to Article