| « 2.5: The Beauty of It | PICKing Javascript » |
Keeping hundreds of millions of sheets of paper on file isn't easy, so the IRS had an application built to computerize their records. It'd scan paper tax returns into a WORM (Write Once, Read Many) drive system and record lookup data in a database. That way they could filter by any fields they recorded in the database and access a scanned image of the tax return for any further information using a simple app, which sure beat the old method of data retrieval — digging through boxes, incurring huge wait times.
The nice thing about the old method, though, was that it generally worked. The new system was full of bugs, in addition to several other irritating issues. On Bobby's first day he was put in front of the application, and right off the bat it looked amateurish. Form elements not lined up properly, buttons not always the same size, inconsistent menus — not broken, but certainly not professional looking. His boss, Boris, explained some of the finer features in a dry, humorless, low monotone.
"Now, as you can see heeere," his boss mumbled, "this is pretty exciiiting." Boris would linger on certain syllables presumably in an effort to sound more boring. His tone made Ben Stein sound like Freddie Mercury. "Allll we have to do is click a feeew buttons annnnd..." Meanwhile he was typing some text into the fields to perform a lookup. "And affffter a few short minnnnnutes..."
Bobby was briefly impressed — Boris had made a joke! Not a particularly funny one, granted, but certainly worth a polite chuckle... except- wait a minute. This is actually taking several minutes, Bobby realized. And in Boris's company, each passing minute felt like ten. Boris whiled away the minutes describing some of his hobbies — toy piano tuning, model taxidermy, palindromatic haiku composition — each somehow more boring than the last. After an eternity (nine minutes) of struggling to keep his eyes open and subtle closed-mouth yawning, an error message popped up.
"Ohhh, fidddddlesticks," mumbled Boris in lifeless anger. As Bobby would eventually learn, the database's average response time for tax form searches was in the realm of 8-10 minutes. That was if you were fortunate, though — it'd often just time out.
Although Bobby wanted to help, he wasn't allowed to lay a finger on the database. He was only authorized to edit the frontend code. So to start off he gathered whatever information he could — the hardware (Pyramid Minicomputer running four 150Mhz processors) and the software (AT&T Unix and Informix DB server). He looked through the code and made a few UI improvements, and it didn't take long for him to locate the biggest bottleneck — the database.
After getting in touch with the DBAs, Bobby was granted read-only ODBC access. Bobby kept making tweaks and little optimizations, constantly running SELECT queries, but couldn't get them to run any faster than six minutes. He dug through ODBC settings, hoping to find something... anything that he could change to improve performance, but judging by the crazy, arcane settings already set up, many had tried the same thing and failed.
Bobby had spent several hours and not gotten far from where he started. Wanting something to show for the time spent, he threw together a quick "Please Wait" dialog box with an animated hourglass to give the bored operators something to watch while they waited for the data to come back.
Several days later, Bobby had a stroke of luck. One of the higher-ups had caught wind of his UI changes and wanted a demonstration. When the topic of speed came up, Bobby mentioned that he'd found the bottleneck and suggested he get full access. It shouldn't be a problem since he worked under constant supervision anyway.
"I don't think that'll be an issue," he said with a smile. "Can you do it right now?"
Bobby hesitated, then gave a weak "...sure." He flushed a little red, concerned that he was wrong and he'd embarrass himself in front of everyone. A DBA was called to the room, who opened up a console window and turned Bobby loose.
Of course, it's highly unorthodox for someone to be given absolute access to one of the IRS's production databases, but no one wanted to question the big cheese. The DBA watched Bobby like a hawk.
Bobby swallowed the lump in his throat, tried to ignore the seven sets of eyes on him, and typed some simple queries in an attempt to isolate the problem. In short:
Bobby fired off a few commands to create indexes on fields that he knew were frequently queried. The DBA's stare intensified. Bobby felt more nervous, hoping that it would all work ok. And after verifying that queries were still being profiled, he ran a SELECT query.
It took 0.22 seconds.
Crap, he thought. There's no way it could've run that fast. I must've totally destroyed the production database. Wiping sweat off his forehead, he ran the query again, this time to verify the results. And the results were correct.
With each query he ran his smile grew larger, as did the rest of the room's collective disbelief. The "Please Wait" window would appear and hide so quickly that no one could even see it.
They tried to chart the difference, but what's the point? 8-12 minutes versus 0.35 seconds average? Management did all but break out the pom-poms and do cartwheels throughout the office. The operators were so ecstatic that they actually threw him a party. As for us US taxpayers, we benefited as well since we're getting taxed faster!
Re: Hastening an Inevitable
2008-07-08 10:10
•
by
andrewbadera
|
Define "properly normalized." You do know that normalization is anti-performance, right? |
Re: Hastening an Inevitable
2008-07-08 10:19
•
by
b1xml2
(unregistered)
|
|
when it comes to the magic figure of millions of rows per table, indexing can be the difference between hours and seconds.
but as they say, seeing is believing and when you touch your hands on production grade data in excess of say 2 million rows, then you will know the difference. |
|
Race past rebel base.
Lt. Rat so startles able Bert's ape car. |
|
I would be tempted to type "DROP DATABASE database_name" and just leave it in the query window.
I wouldn't execute it, but just to give the DBA sitting there a heart attack. |
|
The difference indexing makes is pretty simple if you think about it. Indexing basically stores another table ordered by whatever column you indexed. With this, you can do a binary search, which is O(log n), which means searching 1,000,000 records takes only twice the time that searching 1000 records. Without indexing, the table will have to be reordered with every query. As far as I know, most databases use mergesort, which is O(n log n), then the binary search O(log n).
so, with 2 Million records in the table: indexed lg(2,000,000) ~ 21 instructions non-indexed 2,000,000 lg(2,000,000) + lg(2,000,000) = ~42,000,021 instructions |
Re: Hastening an Inevitable
2008-07-08 11:26
•
by
Mike Dimmick
(unregistered)
|
No, they use a simple linear search which is O(n). The engine will only sort if a sort has been specified in an ORDER BY clause. It may sort to do a join, or it may use hashing. Far more important than the number of CPU instructions is the amount of disk I/O. An operation on a database is rarely CPU-bound - only if the data will fit in the processor cache. It may be memory-bandwidth-bound if the dataset fits in main memory and the data is already cached. Otherwise you're in the (relatively) glacial arena of disk. Remember that even a 150MHz processor could manage to execute 42m instructions in well under a second if all the instructions and data fit in its caches! Even the Pentium was superscalar and pipelined, and able to execute simple arithmetic and logical instructions at more than one per clock cycle. Programmers need to understand the hierarchy of bandwidths and latencies in a modern system. Watch Herb Sutter's presentation (will make most sense to C++ programmers, but managed/Java/scripting programmers should also take note). |
This is the government. To be a DBA you have to, (a) have a computer science degree, and (b) be very skilled at filling out the paperwork. Desire or ability to build quality computer systems is a serious handicap to anyone working in a government IT job. Those kind of people tend to not be willing to spend their entire day filling out paperwork describing what the system would do if we ever got around to actually building it, how it would interact with all the other systems that will never actually be built, and how it would comply with all the applicable regulations written by bureaucrats who have no idea how to build computer systems. Those people always want to rush off and really build something. They cling to naive ideas that one could design a computer system with less than 10 pages of documentation per line of code, or that there should be more programmers than managers on a project. I was there for 13 years. |
| « 2.5: The Beauty of It | PICKing Javascript » |