- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
I've got a good one. Was writing code for a web banking application. One friday afternoon, we put out a new version of our app that introduced a few minor features, nothing big. We ran the update to the database and posted the code. Logged in, did a basic check to make sure it didn't blow up and went home for the weekend.
Came back on Monday morning to shitstorm. Among the things on that particular build, I updated the stored procedure that got a list of the customer's pending transactions. Somehow, between the test environment, where we had done extensive testing, and production, I managed to delete the "go" between the command to create the sp and the command to grant rights on it. So now, every time a customer tried to view a list of their pending transactions, the page would error and the customer would get a generic error message telling them to try again in a few minutes.
What's that? What's the big deal about no being able to see your pending transactions? Good question. The problem was that after a customer initiated an online transfer, the next step after they hit the "confirm" button was to take them to their list of pending transactions where they could see the list with the transaction listed at the top. So you would get create-confirm-error try again. Create-confirm-error try again. Except the transactions were getting created. Over the course of the weekend, several thousand duplicate transfers, totalling millions of dollars were submitted. The only thing that saved my job was that they weren't real-time transactions and we were able to catch the problem before the batch job to actually send them out went through. Was a fun morning of pouring through transactions and deleting duplicates, while getting hounded with "Can we run the batch yet?!?!"
Admin
Delete without a WHERE clause ... whenever I type "Delete From Table" I always, immediately, type "Where" on THE SAME LINE. Even if I stop to think about what the Where clause should say, I make sure it's entered right away, on the same line. That minimizes the possibility that a slip of the hand or some forgetfulness will do something to execute the whole line. (I have, a couple of times, executed the line "Delete From Table Where", but thankfully, that's a syntax error instead of a "Whoops" moment.)
Usually, I type "Select * From Table Where ..." and then, when it looks right, I change "Select *" to "Delete".
Of course, these aren't new techniques, but they are useful.
Admin
I can say that disabling a watchdog timer while debugging an embedded system does, in fact, make debugging a whole heckuvalot easier.
But I don't comment it out... even in my hobbyist code. That's what #ifdef DEBUG is for.
Admin
3 hours before I'm due on a plane out of the country for two weeks and our SQL slave server has a hiccup with the logs table which has about 10 million rows in it.
Desperate to get out of the office and down to the airport I pull up a terminal window to the master to verify the data is ok and then one to the slave and verify the corruption.
Planning to simply start the slave from scratch I issued DROP DATABASE entireSystemDB* and hit enter ... just as I realised I was on the master and not the Slave!!!
I doesn't matter how fast you hit Ctrl-C, that database is still gone**...
*Yes names have been changed to protect the guilty.
** Yes we had a backup but the restore still took forever - in relative terms.
Admin
Not Whoops. Lawsuit !
Admin
Well...did you catch the plane?
Admin
I worked for a school district in the Upper Midwest about 15 years ago. At that time, our backup strategy was...minimal at best. Every so often someone would say "Hey, we haven't backed up the server at School X in a while" and that would trigger a round of backups for each of the 60+ schools in the district.
At that time, backing up the servers required carrying out a gas-plasma luggable PC, taking the school's entire network down, putting a new backup tape into the luggable, running the backup software and waiting for the backup to complete.
One day the call came in: "we've got a crashed server here, and nothing I do will get it running. We'll need to get a new machine out here and restore from the last backup."
Fortunately, we had just completed a round of backups, so there wouldn't be much data loss. We grabbed a replacement box and the backup machine and restored from the last tape. We brought up the server to test it. Or at least we tried. Massive failure.
We eventually got around to using the backup utility to examine the tape: every file on the tape was corrupt.
Uh-oh.
We grabbed the previous tape for that school. Same thing.
We grabbed a tape for another school. Same.
We started grabbing tapes out of the vault. Years of backups for dozens of servers. We popped tapes at random into the backup box, and we came to a horrifying conclusion: we had NO good backups for ANYTHING. Not once in the five years I had been there had any of us ever tested out the restore process.
Oops.
Admin
I was installing point-of-sale terminals in Madison Square Garden. Our system ran all the cash registers in all the concession stands, and a hockey game was about to start in a few hours.
I needed to delete a machine from the DNS zone. Being a DNS newbie, I accidentally deleted the entire zone. Suddenly, none of the POS terminals could communicate with the rest of the system, meaning they couldn't get menu items, pricing, etc.
While two hundred concession stand cashiers sold concessions and made change from their bank bags, I was madly entering A records into DNS. Our system was down for the entire hockey game.
Needless to say, MSG wasn't too pleased with us. And I learned a little about DNS that day.
The one bright point of the episode was that the onsite project manager, at the end of the day, commented "just think of it this way... for the rest of your life, no matter what happens in your career, you'll always be able to say to yourself, 'at least it's not as bad as that one day at Madison Square Garden.'"
So far, he's been right.
Admin
Eventually we would be able to implement backups across the network to large storage machines...but this was back in the day when ISDN was considered fast.
Admin
That's why MySQL has "--i-am-a-dummy" mode, which prevents one from running UPDATE or DELETE statements without a WHERE clause.
Of course, you could run it with WHERE 1=1, but the point is to avoid accidents, not sheer stupidity.
Admin
that is why in a test or in this case slave DB you always prepend the wors test or slave to your names!
Admin
A few years ago our production database went down and the backup turned out to be no good. The IT director was fired shortly thereafter, and we signed a contract with a vendor to handle remote backups for us.
A few months later, our production database went south again. We contacted our vendor and asked for a backup. They replied with two important pieces of information:
Admin
If you were using MS DNS on Windows Server, you could have just reloaded the zone from the file in %systemroot%\system32\dns, deleting a zone does not delete the file. Unless of course it was the Active Directory-Integrated zone, in which case you're screwed, but that's what you get for only having one AD DNS server! If you had more than one AD DNS server you could just replicate the zone from the other server.
Admin
Yes, I could have. But as I said, I was a DNS newbie and didn't know that. (At least not until it was too late to matter.)
Admin
And they didn't sue you for all that ? (second story)
Admin
Just take the D key off of your keyboard... problem solved.
Admin
Haha! You're assuming these PLC systems have things like "ifdef" and "debug modes". I write software that runs on a brand of motion controller that JUST relased a huge update to their $5000+ main product: the ability to pass parameters into subroutines! WooOOoOOOOoo!
Not to mention it stores its program as raw ASCII and interprets it on the fly. It essentially has an array of 80 by X lines of code that it jams your downloaded program in to, so if you write:
x=1 y=2 z=3
That uses 3 lines, and you'd only be able to put this into one of the older model controllers about 300 times whereas x=1;y=2;z=3 only uses one line and you could put it in the same controller a thousand times.
If your program uses more lines than your model controller can accept, it "compresses" it by replacing all the end of line characters with semicolons to try to cram short lines onto these 80-character lines. Editing code that has been compressed without saving an uncompressed copy gets "fun" because all your spacing (indents, etc) is entirely gone.
This all means that the more complex your program, the less likely you are to write in a bunch of (IF DEBUG=1) sections into it because you'll just run out of space that much faster.
Note I'm not defending this guy for forgetting to enable the watchdog, but commenting it out may have been his only option.
Admin
Wrapping it in a transaction and handling it properly would be the right way to do it. Laziness HAS to be 99% of the cause of most problems.
Admin
Time for a new controller then. The only controllers like that these days are ones where people don't want to spend money. All the major ones have functional stuff for years, and many new controllers these days run windows ce.
I do hear you about some of the older controllers, I've been in that phase in one point of my life. No longer, and I'm glad for it.
Admin
void procToBeCommentedOut(...) { #ifdef DEBUG return; #endif rest of proc here }
Admin
Your assuming that industrial controllers are programmed via a source->compiler->executable system. ProTip .. They're not.
Admin
What does the atomicity of a transaction have to do with it? He runs the SELECT first, so he can ensure that it includes the right records. Then he changes "select *" to "delete".
Are you talking about wrapping a single delete in a transaction? That makes no sense.
Admin
But nothing stops you from using a preprocessor, even if the language in question doesn't have a native one. Simply treat the unpreprocessed version as source code and the preprocessed version as object code.
Admin
Admin
My biggest goof to date was being instructed to run an upgrade install of Windows 98 on a computer. Instead, I happened to blank the hard drive and do a fresh install. oops.
Admin
Use Oracle. The last few versions have flashback. You can do an undelete on rows. Incredibly useful for those boneheaded moments.
Admin
you make it an explicit transaction, so you have time to decide whether you want to commit or roll back, when you see 3,000,000 rows affected, rather than 1 row affected, as you thought you should.
Admin
Admin
Now you are assuming that you can get code into and out of these systems in an easily maintainable manner, and that they all have text format versions of their code. ProTip (again) They don't.
Admin
This didn't happen to me personally, but I was told this by the guy who it happened to. He didn't give us the company names and other details to protect the guilty (himself included :-)
His code was used in a factory production environment to control, amongst other things, the temperature of vats doing some tricky chemical reactions. The relevant part of the code was a basic negative feedback loop. Test the temperature, apply correction, lather, rinse, repeat.
Naturally, in an environment like this, you never release code to production without testing it up the wazoo. To help the testing they had a print statement inside that loop, so they could see what was going on in real-time.
Well, they tested it an all was well. Then they took out the print statement, recompiled, and deployed it to the production. No tedious re-testing - there is no possible way this would affect the behavior of the program, right?
Well... it turned out there was a bug in the compiler. Now, this was the 80s, and these things happened more often that today's spoiled brats expect. I have personally seen C++ and Fortran compilers failing - as late as 94. To this day I don't 100% trust these tricksy beasts.
At any rate, the bug reversed the interpretation of the if statement, converting the code to a positive feedback loop. Which, of course, is a very negative thing to have.
The resulting explosion took off the top of the vat, the roof, and everything in between. Luckily this did not include any people's heads - except in the metaphorical sense :-)
Admin
[quote user="campkev"]I've got a good one. Was writing code for a web banking application. One friday afternoon, we put out a new version of our app that introduced a few minor features, nothing big. We ran the update to the database and posted the code. Logged in, did a basic check to make sure it didn't blow up and went home for the weekend. [quote]
What? Deploy on a friday and go home? Are you an f'ing moron? I wouldnt expect that for a trivial system which would hardly be used over the weekend, but a banking app? Actually... I dont believe you.
Admin
Wow, lots of horror stories here. The worst I have is one time forgetting to put the WHERE clause in my UPDATE statement. Whew... you only do that once before you wrap EVERYTHING in a transaction, let me tell you.
Luckily in my case, this was a database that didn't get much use yet and thus turned out to be relatively unimportant.
Admin
Did you try a different tape drive? Just wondering?
Admin
I've got two, both telco related.
** I now work at a large west coast university that at one time had a single Gigaman connection to the NOC. On this day, the Gigaman went down, and it took telco 24 hours to restore the circuit. Their technician was on site for the entire time.
We now have a redundant Gigaman.
Admin
rm -rf /etc /some/folder/I/want/to/trash...
Notice the space after etc.
rm -rf is dangerous.
Admin
You'd be much better typing this out before any updates/deletes:
BEGIN TRANSACTION
Admin
Oooh, that rings a bell.
I had a bug to fix in the admin screen of an (non-web, non-consumer) banking app for stock transactions. Problem was, even though we were working in a test environment, us lowly developers weren't trusted with admin accounts. So the only way to see the admin screen and confirm that the bug was indeed fixed was to replace the isAdmin() function call with "true".
As you've probably already guessed, I forgot to reverse that when checking in the change, it was not caught by the testers (in itself a WTF) and made it into production... suddenly everyone was an admin!
Fortunately, being admin actually didn't mean all that much in this system. The worst thing you could do with it would have been to remove overdraft checking and allow people to buy more stocks than they have money for.
Admin
While working as director of audio programming at a large video game publisher we finished what was predicted to be a good selling title.
It went to the foundry for duplication and I did not get an apporval of the proof prior to the run. The producer got the proof but failed to check it.
The run was made and shipped to stores and support lit up like a christmas tree.
I found a line of top management and the production team at my door that day when I returned from lunch with their knives sharpened as this was a very expensive mistake.
Each of the redbook audio tracks were missing the first few seconds making for a jarring audio presentation.
It must be your code said one, it's your data said another, others just stared daggers.
I unsealed the game box they brought and we tested a virgin disk. All bad.
As this MSCDEX code was in many other games and they all passed test I discounted that theory. I then ripped a few tracks and looked at them in Sound Designer. Definitely truncated.
I was beginning to sweat when I requested the foundry proof. Our workflow had the proof being signed off by the Producer and AP.
They went and fetched the foundry proof and I ripped the same tracks and...
TRUNCATED!
It was amazing how fast the line turned around to stare bullets at the Producer. After that I got a sign off on all foundry proofs.
I also went back out to lunch for the rest of the day as that was all the excitement I wanted to take for one day.
The lesson is: Trust But Verify (or Veri-FRY)
Admin
Admin
But that's not what happened when one of them went awry on October 7 and began sending erroneous data spikes on the plane's angle of attack (AOA) - the angle between its wings and the air flowing over them - to the flight control computer. "For some reason the damn computer disregarded the healthy channels," says Hans Weber, an aviation expert who heads Tecop International, an aviation consulting firm in San Diego. "Instead, it acted upon the information from the rogue channel." The computer, responding to the faulty data, put the plane into a dive. (Read "Is There a Cause for Fear of Flying?")
http://news.yahoo.com/s/time/20090602/wl_time/08599190242100
Admin
Yeah, glad someone else thought that. I thought maybe Snoofle knew something I didn't.
Admin
Admin
I myself am learning these things. In the meantime, "SELECT" lets me narrow my queries to the correct scope before changing to DELETE, which is just a darned good idea.
Admin
And on MS Windows, <SHIFT>-
is also very dangerous. I find myself doing this a lot. Only a few times has it bit me badly.Admin
A unix guy, who should have known better, type "rm -fr /" and then he turned to his idiot flunky buddy and said something like, "I know I'm not supposed to do this in root, but I always double check, so it's okay" turn back around and type " usr/local/app_version_1_backup/".
So what was on the screen was the following:
production_1>rm -rf / usr/local/app_version_1_backup/
Then he casually hit enter and killed the production machine.
But it doesn't stop there! After the chaos had been rectified (by me and the competent guy working 50 hours straight), we decided to upgrade the whole system. We were going to buy two new computers, and set them up to mirror the code nightly, and do round-robin load balancing during the day.
So I take the running machine and set it up next to one of the new machines, and rsync them, and it works during the test, so I did it live, including the -delete flag, so the drives wouldn't be cluttered with mp3s from the first test. I tell everyone to leave it alone. I go home and start working my way through a 3 day stress vacation.
In the meantime, the phb tells the two junior idiots who screwed it up in the first place, to get the second new machine swapped in because he doesn't "trust" the old machine. So they set up the rsync on the new box using the original sync command I'd used (mistake 1). They set it up to sync every night (mistake 2) at 12 (mistake 3), and let it run. Come in the next day, check it, it looks fine (mistake 4)!
Then they take out the old machine and put the second new, blank, machine in. Then they indulged in a little Office Space/Reservoir Dogs style abuse on the old machine(mistake 5), which they blamed for their recent woes (which basically consisted of them being screamed at for being morons while standing around (during the 9-5 period when they worked) watching me fix the machine). Then they went to lunch.
Turns out that cron works on 24 hour time, and so when you put in your sync for 00 12 * * *, it goes off at noon.
So the machine kicks off it's sync at about the time they walked out the door, and since they had thoughtfully included the "-delete" flag, the machine looked for the machine it was supposed to sync from, noted that everything had been deleted since it's last sync, and promptly wiped itself out. Live, production machine.
Admin
I've got two. You choose.
Number 1
I was working as a webdev for an office of a well respected university.
One day a co-worker knocks on my door to tell me that he couldn't pull up our website. I did some checking, and it did seem to be down. I contacted our hosting provider, and they had flipped the switch on our site because we were delinquent on our bill. All we had to do is pay our bill, and the lights would be turned back on.
I asked if they had notified us of this, and they said they had. The only problem was, the contact email address they had on file was the email address of my predecessor. This shop had a very bad habit of not documenting, well, anything. To make matters worse, using our hosting provider's tech support was like trying to navigate through a Rube Goldberg machine.
It probably took a couple hours, but the site eventually came back up and we changed the contact info to something a little more permanent.
Number 2
I was in a Gilbert and Sullivan production in college. One night I come out on stage to sing my solo number, and one of the audience member's primary systems goes down. One of the stage techs walks up on stage to announce someone is having a heart attack and we would have to clear the theater.
The EMTs came and went, then we started again from the top of my song.
Admin
I know that feeling well. Eat a fattening meal, have a alcoholic beverage, spring for desert, wonder if being a park ranger requires any special education...
Admin
Does anyone remember from the first time this story got posted, the comment involving a production oopsie that involved Bengal tigers showing up in the middle of the ocean?
I would have to say that was one of my all-time favorite production failures.
Admin
Same here, only the DBs are more important than that. However, they're used only a certain times of the year. So in between, I'm free to blow them up and restore from backup.
Admin
I don't know what the hell a Gigaman is, but to have a redundant Gigaman sounds absolutely awesome!
Scene: high-speed train rapidly approaching destroyed bridge Our Hero steps forward and says: "Don't worry folks, I have a redundant Gigaman!