- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Actually the real wtf is that they didn't audit his code when it was causing problems with the banks internet connection.
Admin
EXACTLY!
Admin
And every production application that you have written handles all network connection failures gracefully.
Give me a break. Who are you kidding.
I try to give appropriate log messages for network failures, but the key word there is 'try'. My applications would take 100 times as long to write if I were to reliably test every possible network connection failure. And I would be out of work.
Admin
Actually the real WTF here is the number of people that can't seem to understand a fairly straight forward story. I guess maybe the OP should have dumbed it down a little.
Admin
If the program's making direct TCP/IP connections, there's a few things you can do, although there's irreducable unreliability in any network connection. And then there's reliance on NFS-mounted files, LDAP servers, proprietary database drivers, services on the local PC that use the internet ... oh, and was the application web-based?
Admin
You forgot to mention that the application was having "weird things" happen, which may result from properly handling network failures. The nature of the failure was unspecified, but the application apparently continued to function as best it could despite repeated netwrok failures--which sounds like it handled them jsut fine to me.
Of course, the end user was either not informed of the failure (end users often are not allowed to know these things since it reflects poorly on the product) or ignored the error message and observed behavior that was inconsistent with what would be expected if the operation succeeded (which happens more often than end-users, especially testers, realize).
Admin
Ours submitter anonymity (he we will call "from part of Steve") felt this puncture, therefore. It would methodically develop accurately and a module that has admirably worked in atmospheres of the development and the test of the bank, but has pronounced UAT. That that is probadores had uncovered a bug in the module that they could not reproduce. The official description was that the module has rendered "disowned things" to arrive.
Admin
If the way this post was written isn't a joke, then I'd say it's a wtf in and of itself.
Admin
Oh for the love of Bob...
OK, imagine for a second a real QA person filed a good report: "Intermittently, the application shows a 'Network connection timeout after N seconds in routine XX::YY' error box. Difficult to reproduce. Also, some operations take a long time in one test, and a very short time in another; also intermittent."
We'd all say "Hm, some network problem". Not to mention the dev going in would know what to do.
From the information we've been given, it's absolutely possible that the code produced such an error. Nothing in the description disproves it. But the information also states that the dunderhead QA report was just that "weird things" happen.
Given this information, it seems that two types of people view it in two ways:
some give the benefit of the doubt to the coder and scorn the poor QA reporting, not to mention getting a chuckle out of the root cause
others instantly blame the coder for not having the proper error reporting, when it cannot be knowable from the report what error reporting there was since it's not mentioned
Why fall in the second group? Cynical? Aggressive? Projecting? Daddy never loved you? Not to mention that if you make these kinds of leaps and assumptions all the time, I hope your code is reviewed well and isn't placed in any mission-critical systems where those assumptions might fail.
For me, I'll take my chuckle and move on.
Admin
When I added exception handling and logging, I started getting questions about wtf I was doing to break the application since QA were getting error messages and stack traces in the logs when some backend modules were down. Luckily I'm not a junior engineer and have a good enough reputation to be listened to when I pointed out we need enough information when something goes wrong so we can FIX problems instead of ignore them.
Captcha: howdy
Admin
Why it is that is wtf ay? I do not include, when you come and you examine the individuals in my country, my language you speak one I keep excessively condecending the observations the himself.
Admin
Pwned by Babelfish.
Captcha = gotcha
Indeed.
Admin
What it is this "Bebelfish", way to speak over
Admin
Addendum (2007-02-15 17:00): Edit: I formally take that comment back when I reconsider that this was not an end-user report but rather a report from, presumably, QA testers.
Admin
Did you see the server room pics????? This is not a big national banking corporation; this is a small enterprise. I suspect the "QA" people are users who are conscripted to do UAT prior to a release.
Admin
That's just stupid. It's obvious that he should have used a connectionless protocol. And now he's trying to blame the poor tech for his incompetence.
Captcha: dubya - no fair. I hate that guy...
Admin
In the spirit of the Good Fairy B-Nice, I should at this point give "Steve" a big pat on the back for writing his application in such a way that it actually managed to work at all. A good 90% of the TCP-based applications I've seen fail to close sockets cleanly in unexpected circumstances -- and this sort of scenario is almost guaranteed to fall foul of FIN WAIT-2 timeouts. Kudos to Steve.
Admin
He could also create new Exception, Business Rule exception, MoveTheFineFanFromTheNetCableException.
Admin
I agree. SOME programmers would have to be the most pedantantic bunch in history - wasting time making irrelevant comments (like me, I cant believe i'm doing this). Fairly straight foreward story. Obviously the fan was the cause of the problem. If there was a requirement in the specification for exception handling network faults then he should have put that in. But quite often in large organisations, deadlines are more important than PERFECT code. Id say the production environment, as opposed to the UAT environment, could possibly have had redundant network connections (you'd think, being a bank and all)
Admin
How do some of you propose to develop an application that's so ultra-robust as to handle every possible random network error? Honestly, in this day and age, those errors are on the same order as "out of memory".
First of all, how do you even catch the exception? Depending on whether you happen to be accessing a shared file, database, web site, web service, directory/LDAP server, mail server, FTP server, or some other network resource, there are literally hundreds of different possible exceptions/errors that could be generated by the library. Many libraries don't propagate basic network errors, they create their own wrappers around them that says something like "service unavailable". Faulty network connections, like low memory, cause very strange errors to occur, often completely different errors on each successive code execution (I saw this with one of our old workstations here which had a faulty connection, and very similar results on another machine with a defective hard drive).
Even if you caught every single error, then what? What do you do? Try again? What if it still doesn't work? Try 5 more times? 10 more times? What if this is a web site where hundreds or thousands of users may be online and the call you're making is very expensive, computationally or bandwidth-wise?
In many cases, when you're dealing with exceptions rather than error codes, it's better NOT to try to handle those kinds of errors, because they indicate a problem that's totally unrelated to the application, which was true in this case. Let the UI put up a friendly error message if it must, one which indicates that Something Very Bad happened internally, while the low-level exception gets logged and probably e-mailed to the sysadmins or a bug tracker.
Or just wrap a try { ... } catch { } around every single line of code, right!?
Admin
"A lot of things are happening. None of the things that are happening are weird. Some are undesirable, a few are unpredicted, but absolutely none of them are supernatural or caused by witchcraft."
[resolved]
[fixed]
Have a nice day.
Admin
Even with some degree of obfuscation, sensitive production data should NEVER EVER be used on developer machines. Developers like to set up their own playgrounds on their desktop or laptop. They also don't generally have the paranoia ingrained into sysadmins.
My company's production databases only ever leave the server room on encrypted media (and in locked containers). Developers can't touch copies of production data that have not been "cleaned" by a thoroughly paranoid DBA. Accidentally divulging customer data can cost millions, so such protections are no longer optional for really any business.
People wonder why all that customer data was on the VA laptop that was stolen... it was probably a developer or report writer with a "test copy" of the data.
Admin
I think that is part of the problem. This latest code does not handle network breaks - however the older code does. Result even while there was a real physical problem with the network the old code papered over a real problem that should have been fixed. Clearly someone have not been monitoring the network logs properly otherwise they would have noticed the timing of a number of breaks followed a regular pattern.
I wonder with all the traffic overhead how much of a git they were taking. People could have been complaining how slow the network/software was when infact the problem was neither.
Admin
Admin
Seconded. Networks are not perfect connections, they die and the app should handle that.
Admin
It's better than a CUM Release.
Admin
Sorry, but server rooms and non-functioning AC quickly overheat, causing the servers to shut down, if the room contains any respectable number of servers.
I know this because it happened to one of my prod servers only a couple months ago. Within 2 hours we were up to 90 F. and we were shutting down 'non-essential' prod servers trying to prevent the essential servers from shutting down on their own.
Why yes, since you ask, I do work for a bank.
Admin
Lets be real, it could happen, screw the comments about back up ACs and safeguards. They're guarding your money, which is FDIC insured, FDIC doesn't require AC, and techs that do, well, should get paid less.
Banks are a business just like any other except they comply wiht a WHOLE lot more federal regulations and pay a WHOLE lot more to attorneys to make sure they do and well, someone's gotta pay for that.
~Signed A bank manager with a BS in Comp Sci
Admin
Hmm -- somewhat related: Years ago I worked in the Microfilm industry. We had a brand new microfiche camera that sported a "Digital Titling System"; only the second camera I am aware of to have such a feature at the time. The only trouble was the fact that the titles were blurred. But only some of them. Other images on the same film roll were nice and sharp.
After weeks of investigation by the technician, we discovered that the cameras were on the same electrical circuit as one of those huge pedestal fans that was in use in a room without AC. Just plugging in the fan caused the issue (not even running) since the motor coils acted as a huge inductor, sucking in RF noise and causing the CRT image to fuzz...
Admin
Leave this guy alone, obviously his native language is Spanish and he doesn't realize that his English is bad.
Admin
Admin
If an application cannot report a connection lost kind of error without anyone intepreting it as "weird errors" its shit coding then. End of discussion.
Admin
Many many moons ago, I worked on an IT helpdesk for an industrial gas company. A girl phoned me up, claiming that "very often when she was on the phone, her screen would go all strange". Obviously I was fairly sceptical about this one, not believing that a telephone could generate enough interference to disturb a monitor ( in fact it was an HP green-screen dumb terminal ). However, she insisted that this was a real problem, so I went up to her floor to have a look. I tried using her phone, no problem at all, couldn't reproduce the issue or anything like it. Much puzzlement. In an inspired moment, I asked her to try using the phone; of course, sitting where she was, with the phone under the monitor, the handset cable rubbed on the underside of the screen, tweaking the contrast knob with every movement of her head, and the screen indeed "went all strange"! Moving the telephone 3 inches to the right seemed to solve the problem...
Admin
Can a fan replace broken air con? Maybe for people (who sweat) but not for servers - it's not going do much but stir the hot air.
Admin
I'm reminded now of a time when the company I worked for installed a new barcode scanner system in an industrial laundry up in Leeds. The system had been well track tested and there was no special requirements for this customer. After a couple of weeks trouble free usage I got a lot of angry phone calls from the customer citing misread barcodes and the associated trouble that caused in the database. A site visit was deemed good for customer relations. On arriving at the site I found that the customer had moved the scan station from its original location - and tie wrapped the network cable neatly to the industrial 3-phase lecky supply. Maybe that had something to do with it.
Admin
Yeah! I do like paranoid programmers ;)
Admin
Bad luck I'd say :D That's why my stuff is in boxes/maps with names like "TOP SECRET", "BREAKABLE", "TOXIC HAZARD", "DANGER: EXPLOSIVES" etc... :P
Admin
That happened to me many, many times when I worked for a company which provided software maintenance&management for an insurance company (read: fixing bugs and tweaking their software to support their changing business model on a daily basis).
I never worked for a bank, but judging from what I read here they are very much like insurance people. Very often I got a call from our customer informing me that a weird error had happened at some agency. No, they couldn´t give me an accurate description of the error, but somehow some data in the database had been corrupted. No, they couldn´t tell me what the agency was exactly doing before the error happened. No, none was able to reproduce the error in a test environment, and of course trying to reproduce it in production environment is out of the question (none wants to corrupt more data just to make my job easier, understandably, and anyway I´m sure it would have been impossible to reproduce in production anyway).
So, what should I do? I chased ghosts for a couple days, flagged it as "intermittent problem, please inform me if it happens again" (almost never did) and forgot about it.
Waste of time, but I was paid to do it. Way I see it, computer people have a lot in common with hookers: we are not paid to solve problems, we are paid to give satisfaction to customers. If me chasing ghosts gives them pleasure, so be it.
Btw, 90% of the time this kind of problems could be traced to someone launching ill-thought sql queries, BY HAND, against production database. My guess is someone in our company IT dept thought he had l33t skillz, and he could hack away some problem or other launching sql commands by hand. Not bothering to go through proper applications that access the database with respect for the business rules, of course. Resulting in data not-conformant to business rules, hence "data corruption". I never found out who it was. Yes, I could prove it, buy the costumer vehemently denied anyone was recklessly launching sqls by hand, and the costumer is always right. Yes, this could be fixed with a sane permissions policy and a few triggers enforcing business rules, but I guess they thought that would stop them from using their ugly sql hacks, so it was never done despite my advice.
End of the day, if I am paid to track down a problem with a poor description and which probably is nonexistant in the first place, then I will probably do it. Who knows, sometimes the costumer is right and I actually find something to fix.
Admin
Ahem, no.
It's one thing if your app can't connect to something external, that's what queues are for... yes, by all means do your best to be robust, especially if it's a system you can't get your hands on.
However, if the client's connection to your app itself is unreliable, there's not much you can do, unless you want all your requests to come from javascript with its own timeouts and its own error pages. I think that degree of paranoia and the associated bloat would be a nice dailyWTF.
Admin
Just because the UAT server is in the secure server room, why do you assume that the UAT testing also occurs there?
Much more likely that the UAT testers sit in their own room (possibly with A/C, but not really relevant) where a number of UAT client machines are connected to the UAT server via the network. The physical cable connection at the server end was being intermittently disconnected by a badly situated fan device. I assumed that the reason Steve wanted to access the server room was to determine what was different about the UAT server compared to the previous dev/test servers, where the code ran fine.
I think the real wtf is/are: (a) that the people responsible for the server room didn't realise that a network cable was being disturbed by the fan i.e. that the developer was responsible for diagnosing a hardware problem (b) that it took so much time and effort for Steve to be granted access to the server room to investigate, having already taken all reasonable steps to show the code was not the problem.
Admin
That's a clever trick. So, was the truck delivering the error message?
Admin
[[My response would be: I checked it out, weird things are not happening, problem solved. And then proceed to do nothing until a better bug description was forthcoming or I was fired.]]
Amen.
Admin
Our anonymous submitta' (we'll call him "Steve") felt dis stin', too. 'S coo', bro. He'd carefully and medodically developed some module dat wo'ked finely in de bank's development and test environments, but failed UAT. Testers had discovered some bug in de module dat dey couldn't reproduce. De official descripshun wuz dat da damn module caused "funky doodads" t'happen. 'S coo', bro.
Admin
There are quite a few companies that have a culture of shoot first, ask questions later. I had been called on the carpet several times before, when I was not to blame at company that literally deployed the biggest house of cards in the world. I left that place, and found a better place to work. I think that the company "Steve" worked for has the same mindset and culture. Some say it's ignorance on the part upper management, some say they have no protection by their supervisor, I say they can all blow it out their collective... well, anyway...
Admin
Having a QA dept which can reproduce the problem, but won't even let you into the room to see them doing it is not reasonable.
Admin
Or "I know what I did but if I play dumb, someone else might fix it and no blame will stick to me"
Admin
I remember a case where a home user claimed the computer would trash when the dog barked. The skeptical tech eventually went to the house, and nothing seemed wrong till they put the cat out. After a flurry of barking, the computer reset.
Apparently the dog's 'invisible fence' was plugged into the same outlet as the computer.
Admin
No, fish (mostly).
Admin
Go back and re-read the OP. There is ZERO evidence that it was a QA dept. All he said was UAT and testers - that could very likely mean end-users tasked with testing the app before rollout.
Admin
A lot of folk assume that the server's LAN connection isn't going to "come and go". Unfortunately, it can happen.