• Shea (unregistered)

    Without knowing much about the environment, but having a background with MSCS, custom and NLB/WLBS clusters it sounds as if there is either:

    1. A cheap switch between the nodes... and/or...
    2. Servers which have "auto" set on their NICs rather than specifying a 100/full or 1gb connection.

    Either can cause the small loss of packets they are experiencing. Yes the infinite loop is the software developer's fault, but if you over-rate cabling (if copper) your packet failure rate on a well-tuned network should be as low as 1 packet in 1 trillion (I've built such networks).

    Also avoid home-made cabling (bad NEXT/FEXT usually) and cheap switches :)

    Just a thought... doesn't sound like there's a fix in the software portion of the system but preventing the house of cards from falling might be a solution.

    -S

  • Anonymous (unregistered)

    Driving in, slowly, is the answer.

    Perhaps you missed Massimo's comment on the setup being a black box? Other than feeding power, he has no control over the thing?

    I'd drive in slow, making sure the system is down for a while every time it happens. SInce it's PPV, that's lost revenue. The only way management will call the vendor to fix their application.

  • Broadcast Engineer (unregistered) in reply to operagost
    operagost:
    Massimo:
    Lastly, VPN: there was no one (due to "security" reasons), and although I suggested implementing it, management just didn't like the idea.
    But they liked the idea of extended down time while you drove in?
    Unfortunately, stations in the TV and radio industry do not maintain well versed IT staffs; usually leaving a team of engineers (with a possibility of having IT knowledge) to handle issues like this, but -- even if they know the issue and how to fix it -- unable to actually touch networking configurations because the understaffed IT department is in charge of them and allowing someone from outside the department to touch it would be a security issue (and likely a breach of a number of union contracts, depending upon where you are).
  • Anonymouse Coward (unregistered)
  • eric bloedow (unregistered)

    reminds me of a story in a book called "the hacker crackdown": a new version of a program the phone company used had a glitch...when one server went down, it would send a message to other servers that it was stopping for a reboot, then another message when it recovered. BUT a misplaced Return statement made the servers crash the second time they received a "recovered from a crash" message! this meant terribly slow performance for the phones all day-no dropped calls, just very slow to connect. the phone company fixed the problem and issued an apology, BUT the politicians, completely misunderstanding the explanation, decided that Hackers were responsible...partly because it happened on a holiday. (MLK day)

Leave a comment on “Cluster#$%&”

Log In or post as a guest

Replying to comment #:

« Return to Article