The Daily WTF: Curious Perversions in Information Technology

2009-01-21 Reply Admin

Nagravision?

2009-01-21 Reply Admin

A hundred thousand dollars??? I wonder how many seats of VMWare Infrastructure that would buy. Plenty, I would have thought.

2009-01-21 Reply Admin

Perhaps if Massimo could properly maintain his network, the application wouldn't need to lose its connections?

2009-01-21 Reply Admin

So, why exactly do the servers momentarily lose their connection to one another? I agree it stinks that the software completely fails if they do, but couldn't that be made moot by having a more reliable connection between the servers?

2009-01-21 Reply Admin

Yay Irish girl is back :o)

Charles400 · 2009-01-21 Reply Admin

This is a redundant comment.

2009-01-21 Reply Admin

Seriously, Nagravision or IDC.

fennec · 2009-01-21 Reply Admin

Charles400:
This is a redundant comment.

Not yet it isn't.

This is a redundant comment.

2009-01-21 Reply Admin

Anonymous:
A hundred thousand dollars??? I wonder how many seats of VMWare Infrastructure that would buy. Plenty, I would have thought.

I was also thinking that this architecture is begging to be virtualized. That $100,000 would have gone a hell of a lot further than it did in the article.

2009-01-21 Reply Admin

Oh, you just don't understand 'enterprise' grade hardware and software.

For you see, you take anything that's available to end user customer, mark it up 500-1500%, hit it with a hammer sometime, if cables, tie a couple of knots in them. Then you sell it to 'enterprises'.

As for software.

If there is a simple coloration between problem <-> solution, you must take at least 4 detours, get on a bus with a floppy, upload it via satellite, change byte ordering a couple of times and then you may be approaching the 'enterprise' grade solution for the problem.

2009-01-21 Reply Admin

[quote=Alex]Redundancy Manger[/quote]

Intentional? With my luck, it was.

2009-01-21 Reply Admin

Sure they didn't use the fancy-schmancy Windows functionality? It used to do the exact same thing, which was fun when two boxes were connected to the same storage system...

If shared hardware is used in a cluster, the boxes need a way to make sure the other box can't access the hardware or these things happen no matter who made the software.

2009-01-21 Reply Admin

This problem cries out for a retired 386 PC with a CD-ROM drive programed to unload the CD if the "enterprise" software fails.

Or, does that solution only work for unlocking the security doors?

operagost · 2009-01-21 Reply Admin

OpenVMS introduced clustering in 1982 and yet most of the industry still can't figure it out. What is the point of the "Redundancy Manager" if it's not being used to determine quorum?

m0ffx · 2009-01-21 Reply Admin

Since the fix involved unplugging network cables and restarting apps:

Replace hub with some sort of fancy managed switch that can electronically 'unplug' a computer.

Use some sort of monitoring software (maybe Nagios?) to detect the problem, and disconnect the computer, then run a script to do the restarting thingy.

Indeed, Massimo has an inability to administer these servers.

2009-01-21 Reply Admin

The real WTF here as that he has to drive to the office to work on this.

2009-01-21 Reply Admin

Ah, split-brain - the bane of any cluster setup. There are many good ways of handling it - redundant heartbeats, quorum drives, etc. And then there are bad ways - driving into the office to plug and unplug network cables.

2009-01-21 Reply Admin

This is commonly referred as a "split-brain" problem. Apparently ,testing was not the priority for such a high price. Nothing to see here, move along.

undrline · 2009-01-21 Reply Admin

fennec:
Charles400:
This is a redundant comment.
Not yet it isn't.
This is a redundant comment.

As the Master Comment, I feel that you should just both duke it out for commenting resources.

valerion · 2009-01-21 Reply Admin

So there was only one server with the Redundancy Manager software on it?

Doesn't sound very redundant to me.

2009-01-21 Reply Admin

The real solution is to just leave the network cable unplugged.

2009-01-21 Reply Admin

m0ffx:
Since the fix involved unplugging network cables and restarting apps:
Replace hub with some sort of fancy managed switch that can electronically 'unplug' a computer.

Use some sort of monitoring software (maybe Nagios?) to detect the problem, and disconnect the computer, then run a script to do the restarting thingy.

Indeed, Massimo has an inability to administer these servers.

I came here to say this. With a small budget and a couple of days, this problem could easily be solved. Of course, that still doesn't address the underlying problem of why this happens so much...

JamesQMurphy · 2009-01-21 Reply Admin

Joshua Ochs:
Ah, split-brain - the bane of any cluster setup. There are many good ways of handling it - redundant heartbeats, quorum drives, etc. And then there are bad ways - driving into the office to plug and unplug network cables.

Salmonymous:
This is commonly referred as a "split-brain" problem. Apparently ,testing was not the priority for such a high price. Nothing to see here, move along.

Now we have a redundant comment.

2009-01-21 Reply Admin

m0ffx:
Since the fix involved unplugging network cables and restarting apps:
Replace hub with some sort of fancy managed switch that can electronically 'unplug' a computer.

Use some sort of monitoring software (maybe Nagios?) to detect the problem, and disconnect the computer, then run a script to do the restarting thingy.

Indeed, Massimo has an inability to administer these servers.

A virtualized infrastructure could achieve all of these things without requiring any additional hardware or software. Of course, it would also render the "redundancy manager" redundant so all the extra hoop-jumping becomes irrelevant. I wouldn't say that I'm a massive proponent of vitualization but as I said before, this architecture is absolutely begging for it.

Code Dependent · 2009-01-21 Reply Admin

Right around 4:45am, Massimo pulled into the TV station's parking lot and was in the server room by 5am. Sleepily, he walked up to the network hub and unplugged the network cable on one server to stop the bleeding

Two words: VPN

jimlangrunner · 2009-01-21 Reply Admin

fennec:
Charles400:
This is a redundant comment.
Not yet it isn't.
This is a redundant comment.

A Redundant comment on a redundant comment must be redundantly redundant.

2009-01-21 Reply Admin

Code Dependent:
Two words: VPN

Umm, I think you'll find... oh, never mind.

jimlangrunner · 2009-01-21 Reply Admin

Jim:
Code Dependent:
Two words: VPN
Umm, I think you'll find... oh, never mind.

Two words: Wow.

2009-01-21 Reply Admin

Not related to the article, but "We found an Irish girl to hold the book" is full of win.

(if you don't get what I'm talking about, whitelist this site on Adblock like a good little forum monkey)

2009-01-21 Reply Admin

The proper solution here would be to use a tried and true method of load balancing services like LVS (Linux Virtual Server). LVS can load balance any TCP/UDP service and do so with proper failover.

I have a web cluster running that has 2 machines currently serving HTTP and HTTPS requests for our web developered application (we never had enough load to scale up to more nodes). Then there are two load balancer boxes running keepalived (which hooks into LVS), constantly monitoring each other and the boxes which run the HTTP/HTTPS service.

The main difference is that keepalived works in a sane way and in over a year of having this solution deployed, it hasn't gotten all mucked up ONCE. I generally go in and test it once ever few month (unplug the network cable from one of the load balancers and make sure that the other one picks up the service IP and service isn't lost, unplug the web servers from the network and make sure they are taken out of the pool of available servers to send requests to, make sure the primary load balancer takes over again when it gets its network connection back, make sure servers are added back into the cluster automatically once they get network back).

Everything just works. The WTF here is using proprietary Windows crap for mission critical installations.

2009-01-21 Reply Admin

I don't remember the name of the product - it was something that mirrored the server's memory so that if one server went down you wouldn't notice. It was funny to see the two mirrored screens :-) That is...until the fiber between the servers was diturbed and the servers stopped seeing each other, and at best one of them got completely corrupted. Don't ask about the worst that could happen :-(

2009-01-21 Reply Admin

fennec:
Charles400:
This is a redundant comment.
Not yet it isn't.
This is a redundant comment.

Not yet it isn't.

This is a redundant comment.

WhiskeyJack · 2009-01-21 Reply Admin

Chris:
The real WTF here as that he has to drive to the office to work on this.

Yup. Just ssh into the remote box and use the "unplugNetworkCable" command, and go back to sleep...

2009-01-21 Reply Admin

Agreed with previous commentators: fix the network. If that's not enough, write some code to shut down the network connection of Server A when Server B's traffic spikes (or touches certain resources indicating that it is live)

2009-01-21 Reply Admin

TRRRWTF is that they are using hubs.

2009-01-21 Reply Admin

[image]

2009-01-21 Reply Admin

Ok, as usual Alex spiced up the thing a bit, so maybe some clarification is due ;-)

First of all, the whole system (which was not responsible for actual video broadcasting, but only for conditional access management, i.e. pay per view) was sold as a "black box" from its the vendor to this TV station, and no one could modify, or even question, its architecture; they asked for 4 Windows 2003 servers, installed their softwares on them, set up various other appliances (multiplexers, encoders, crypto devices, etc.), and then left all of this to us to manage, with the agreement that if we changed anything in the setup, they weren't going to support it anymore; so, even if there were lots of ways to "do it better", we just couldn't do anything.

Second, the problem didn't actually happen so often: luckily it only happened one time, when a network cable got accidentally unplugged and we discovered how wonderfully the system was "handling" the situation.

Third, some more details on the system's architecture: it consisted of two clustered SQL Servers (yes, a plain old Windows cluster, so they actually knew how to make one...) and two "clustered" application servers, which ran the main application and were tied together by the Redundancy Manager; this plain Windows program (which amongst the other things had to be run in a GUI session) was in charge of starting and stopping the main application services and failing them over when needed, but, as we now already know, it failed miserably if the heartbeat connection was lost, leading to two active nodes which got seriously angry at each other; but the main WTF was the absolute inability of the RM to solve this situation even if the heartbeat connection went back on again: the two nodes would just keep fighting until someone stopped both manually, and then started again only one of them.

Lastly, VPN: there was no one (due to "security" reasons), and although I suggested implementing it, management just didn't like the idea.

And oh, yes, the system was priced really high.

2009-01-21 Reply Admin

Regarding the vendor: don't want to make names, but maybe you can look for a company based in the Netherlands which sells systems for conditional access managements in digital TV...

2009-01-21 Reply Admin

Where did I see that before... mmmm.... mmmm.... Aha! I know! Every fucking place I ever worked at!

Man, there are advantages to a big corporation career, but it does get you down if you have low tolerance for stupidity.

2009-01-21 Reply Admin

Salmonymous:
This is commonly referred as a "split-brain" problem. Apparently ,testing was not the priority for such a high price. Nothing to see here, move along.

Yes and as another poster has said VMS had ways to take care of this way back in the 80s, and without the stupid Linux Clustering "Shoot the other guy in the head" way of thinking.

So then you get Oracle's RAC and they think, "Oh my it's a cluster I have to determine if it is split brain and reboot machines and what not."

No Oracle, leave that stuff to the OS and the cluster sub-system, stop trying to understand the hardware and give me a better database.

2009-01-21 Reply Admin

mmm, split brainnnnssss

Satanicpuppy · 2009-01-21 Reply Admin

Windows Clustering is an oxymoron. A real failover cluster should never have that sort of conflict.

2009-01-21 Reply Admin

mmm, split brainnnnssss

The key to split brains, you see, is soaking them overnight.

2009-01-21 Reply Admin

Anonymous Coward:
[image]

That's not Art Vandelay !!!

2009-01-21 Reply Admin

I don't understand why massimo (or is this Alex's edits?) would argue what the supplier will and won't do. That's management's job to go back to the supplier and tell them their setup isn't working and they need to fix it.

2009-01-21 Reply Admin

Massimo:
Lastly, VPN: there was no one (due to "security" reasons), and although I suggested implementing it, management just didn't like the idea.

That alone is reason to quit.

We should have a WTF one day that is just...

"So Eddie started working at this new company and then discovered that they didn't allow VPN so he had to drive in any time there was a problem. The End."

2009-01-21 Reply Admin

Art Critic:
Anonymous Coward:
[image]
That's not Art Vandelay !!!

And you want to be my latex salesman....

2009-01-21 Reply Admin

Anonymous Coward:
[image]

I admit I am not fully versed on the "irish girl" thing, it was before my time here, so I am left to wonder if this is the original irish girl or what? Or just some busted tees girl.

2009-01-21 Reply Admin

PolyServe?

2009-01-21 Reply Admin

[image]

Cluster#$%&

Leave a comment on “Cluster#$%&”