The Daily WTF: Curious Perversions in Information Technology

2008-07-08 Reply Admin

i have a database with 800mbytes of data and around 6 millions rows. and "a few minutes" for schema change is 30+, the indexes take 600mbytes and doing the first query takes a full minute (while it loads the index).

so much for relational speed

2008-07-08 Reply Admin

Mike Dimmick:
Far more important than the number of CPU instructions is the amount of disk I/O. An operation on a database is rarely CPU-bound - only if the data will fit in the processor cache. It may be memory-bandwidth-bound if the dataset fits in main memory and the data is already cached. Otherwise you're in the (relatively) glacial arena of disk.
Remember that even a 150MHz processor could manage to execute 42m instructions in well under a second if all the instructions and data fit in its caches! Even the Pentium was superscalar and pipelined, and able to execute simple arithmetic and logical instructions at more than one per clock cycle.

You're right, I thought about this afterwards. Linear search would be faster, so I have to assume that's what would be used. It's still ~2mil vs 21 (binary search is still much faster if the list is sorted). I also used the words "instructions" loosely. Obviously if all the information is in cache, or even ram, 42mil really isn't so bad. I was talking instructions as in the big-O notation definition.

2008-07-08 Reply Admin

andrewbadera:
Peter:
I bet the tables also weren't properly normalized.

Define "properly normalized." You do know that normalization is anti-performance, right?

You might want to refine that to "You know that normalization is anti-performance for Non-Aggregate SELECT statements provided the Logical-Physical Confusion persists on the RDMS platform of your choice."

2008-07-08 Reply Admin

A brilliant and breathtaking tale. Until the end I kept wondering what would happen! O_O;

2008-07-08 Reply Admin

Normalization is anti-performance only when there is a JOIN involved. In other situations it can actually IMPROVE performance, because you have a smaller set of data.

2008-07-08 Reply Admin

andrewbadera:
Peter:
I bet the tables also weren't properly normalized.

Define "properly normalized." You do know that normalization is anti-performance, right?

No, good normalization does not degrade performance. Suppose a table has 2 million rows & two FOREIGN KEYs. Both speed & space can be imporved.

One key points to a VARCHAR(200), which is stored only once. The table's disk space holds only a 4-byte integer, or at least 200/4 = 50 times less space.

Users often search on both these FOREIGN KEYs. The database engine can much more easily index integer keys than VARCHAR(200) strings or most other datatypes. Also, some databases automatically create indexes on PRIMARY or FOREIGN KEYs. The database performs faster due to this normalization.

obediah · 2008-07-08 Reply Admin

RealDatabaseDeveloper:
You know that "properly normalized" is pro-performance?

I know that normalization and performance are two distinct, although moderately correlated dimensions for evaluation database design.

I also know that making blanket statements in the squishy parts of CS is a recipe for fail.

It's comments like yours that make real DBA's watch developers like a hawk. There is a reason we run Relational Database Engines everywhere these days, when you model the problem correctly and index it properly it runs very very well.

If you were a "real DBA" you would know that modeling the problem correctly and normalization aren't always the same thing. Given the same blob of data, a database built for management/integrity is going to look very different than one designed for mining.

Huf Lungdung · 2008-07-08 Reply Admin

Crabs:
The difference indexing makes is pretty simple if you think about it. Indexing basically stores another table ordered by whatever column you indexed. With this, you can do a binary search, which is O(log n), which means searching 1,000,000 records takes only twice the time that searching 1000 records. Without indexing, the table will have to be reordered with every query. As far as I know, most databases use mergesort, which is O(n log n), then the binary search O(log n).
so, with 2 Million records in the table: indexed lg(2,000,000) ~ 21 instructions non-indexed 2,000,000 lg(2,000,000) + lg(2,000,000) = ~42,000,021 instructions

Interesting...

I thought that an index on a column basically created a hash table, which relates each field in that column to its corresponding row in the table. This would make inner joins on indexed columns a linear time operation in the total number of rows in the smaller table.

Your approach, however, is better suited if a column doesn't have a unique constraint, or for running queries with a less-than / greater-than operator on that column.

2008-07-08 Reply Admin

Craig:
Normalization is anti-performance only when there is a JOIN involved. In other situations it can actually IMPROVE performance, because you have a smaller set of data.

Normalization makes no sense unless a JOIN is involved! Do you know what normalization means?

Look up the several kinds of JOINs in SQL. INNER JOIN acts like a WHERE clause, and it's the one most people understand. The outer joins, LEFT JOIN & LEFT JOIN syntax, are also important.

2008-07-08 Reply Admin

This story reminds me of a problem I had to solve once. Seems there was this billing process that took an awfully long time to run. In fact, as the DB grew, the time to process had slowly grown to over 24 hours, which simply couldn't be accepted since the billing cycle was 24 hours.

Didn't turn out to be an indexing problem, though. Seems there was this query that returned all the rows sorted by a Time field, just so that the program could use the first row and ignore all the others. Already tremendously bad, given the size of the DB. But this query was run ONCE PER ROW. If the table had a million rows, you'd read in the entire million rows a million times!! A simple change the query to use a MAX() and the process was reduced from 24+ hours to less than 24 seconds.

I didn't get a party, though. Turns out they could barely acknowledge what I'd done, because the entire process had been coded by a "golden boy" who received accolades, plaques and bonuses on a weekly basis. Guess it's easy to look like a hero if what you first produce is pure crap, but does manage to get the job done. A little praise up front, and lots more when you improve the system by fixing the numerous bugs you initially put in to production. Combined with the ability to BS, and you get "golden boy" syndrome.

2008-07-08 Reply Admin

Huf Lungdung:
Crabs:
The difference indexing makes is pretty simple if you think about it. Indexing basically stores another table ordered by whatever column you indexed. With this, you can do a binary search, which is O(log n), which means searching 1,000,000 records takes only twice the time that searching 1000 records. Without indexing, the table will have to be reordered with every query. As far as I know, most databases use mergesort, which is O(n log n), then the binary search O(log n).
so, with 2 Million records in the table: indexed lg(2,000,000) ~ 21 instructions non-indexed 2,000,000 lg(2,000,000) + lg(2,000,000) = ~42,000,021 instructions

Interesting...

I thought that an index on a column basically created a hash table, which relates each field in that column to its corresponding row in the table. This would make inner joins on indexed columns a linear time operation in the total number of rows in the smaller table.

Your approach, however, is better suited if a column doesn't have a unique constraint, or for running queries with a less-than / greater-than operator on that column.

There are several ways to index data, based on what kind of searches you want to do. Learn more from the open source databases.

See the Spatial Index feature in MySQL. It uses the R-Tree (Rectangular bounding-box Tree) to limit 2D searches to a smaller area.

See the Index Types in PostgreSQL. Notably, PostgreSQL uses GiST indexing for spatial queries, since its R-Tree implementation wasn't good.

2008-07-08 Reply Admin

I had this problem some time ago, I imported some tables from excel, using a little utility, I normalized the database, but somehow, I forgot keys, I had this same problem, queries took too much time, and I remembered the keys thing, and I had a similar performance boost.

2008-07-08 Reply Admin

akatherder:
I would be tempted to type "DROP DATABASE database_name" and just leave it in the query window.
I wouldn't execute it, but just to give the DBA sitting there a heart attack.

... aaaaaaaaaaaaaaaaaaaaaaand THAT'S why we can't have nice things.

Twerp.

2008-07-08 Reply Admin

Wayne:
Bah... these are the DBA's who didn't know enough to build an index? I doubt they'd be phased by "DROP DATABASE" or even "DROP TABLE".

Yes, but would they be fazed? (And "DBAs" does not require an apostrophe.)

2008-07-08 Reply Admin

In all this talk about normalization, how is it no one's brought up 3rd normal. Of course even a perfect 3rd normal database isn't necessarily going to give you peak performance. But generally, a sensibly normalized database combined with the right indexes will make a huge difference.

Having said that, I've run into a case or two where denormalizing was the answer.

With regards to foreign/primary keys, I know that Oracle enforces their uniqueness by drum roll... creating indexes, and I suspect most other databases do the same.

Zecc · 2008-07-08 Reply Admin

akatherder:
I would be tempted to type "DROP DATABASE database_name" and just leave it in the query window.
I wouldn't execute it, but just to give the DBA sitting there a heart attack.

You'd probably be punched in the face after you finished typing 'DROP '.

Outlaw Programmer · 2008-07-08 Reply Admin

The best part about this story is that it brought all the old-school technical people out. Lately, there have been far to many business-related WTFs where all the comments have been cliches and stale memes. It's refreshing to see a heated discussion about programming and computer science again!

2008-07-08 Reply Admin

Try ~175 billion rows in a vector db (nine years of stock market records).

2008-07-08 Reply Admin

Itzac:
Having said that, I've run into a case or two where denormalizing was the answer.

I've run into that too. But...

Every single time we've ever had to denormalize something for speed, eventually the braindead peons get ahold of it, add a few dozen columns because they're just following what someone else did (rather than thinking about the problem critically), and completely ruin the speed benefits.

So eventually we end up renormalizing it in the end anyway, and the code ends up being a horrible mess for having changed back and forth.

It's not worth it. Unless you have an awesome DBA who will always be there and is a complete dictator, some retard is going to come along and ruin the model if you don't keep things normalized properly. Sad but true.

2008-07-08 Reply Admin

A contract programmer I worked with once regaled me with a similar story. He'd been called it to fix a "slow cron job" on a PHP system that was taking an hour and a half to run. Trouble was, it was run once an hour.

Adding indexes to tables, an act that took him less than 10 minutes, reduced the cron job's run time from an hour and a half to 6 minutes. That's a 1500% increase in speed.

The client he was working for (who had written this whole system himself) was astounded.

2008-07-08 Reply Admin

Mean Mr. Mustard:
How the heck did this DBA earn his title?

This is the government. To be a DBA you have to, (a) have a computer science degree, and (b) be very skilled at filling out the paperwork.

Desire or ability to build quality computer systems is a serious handicap to anyone working in a government IT job. Those kind of people tend to not be willing to spend their entire day filling out paperwork describing what the system would do if we ever got around to actually building it, how it would interact with all the other systems that will never actually be built, and how it would comply with all the applicable regulations written by bureaucrats who have no idea how to build computer systems. Those people always want to rush off and really build something. They cling to naive ideas that one could design a computer system with less than 10 pages of documentation per line of code, or that there should be more programmers than managers on a project.

I was there for 13 years.

dpm · 2008-07-08 Reply Admin

Jay:
I was there for 13 years.

I have your cube now. Unfortunately I'm not so good at the paperwork.

merreborn · 2008-07-08 Reply Admin

andrewbadera:
Peter:
I bet the tables also weren't properly normalized.

Define "properly normalized." You do know that normalization is anti-performance, right?

Not always. The first database I worked with used the same 64-char varchar as the primary key on each of three related tables.

Storing the varchar in the first table, adding a serial integer primary key on that table, and using that id on the other two related tables was a huge performance boost.

A poorly designed, unnormalized schema can easily be far less performant than a normalized one. To denormalize, you must first understand normalization. If you, like those in this story, don't even understand the necessity of indexing, then you're probably not going to successfully come up with a performant denormalized schema.

2008-07-08 Reply Admin

but if the DBA didn't know about indexes and primary keys, what makes you think he'd know about a drop table command?

2008-07-08 Reply Admin

andrewbadera:
You do know that normalization is anti-performance, right?

I don't, and I have a hard time believing this. Can you explain?

Rob

TheDude · 2008-07-08 Reply Admin

DeLos:
The real WTF is that they aren't using flat files to store the data instead of a slow database!

Excel spreadsheets are even faster than flat files.

lanmind · 2008-07-08 Reply Admin

Jake Vinson:
As for us US taxpayers, we benefited as well since we're getting taxed faster!

Thanks, jacka$$! :)

2008-07-08 Reply Admin

andrewbadera:
Peter:
I bet the tables also weren't properly normalized.

Define "properly normalized." You do know that normalization is anti-performance, right?

Not always

2008-07-08 Reply Admin

F:
i have a database with 800mbytes of data and around 6 millions rows. and "a few minutes" for schema change is 30+, the indexes take 600mbytes and doing the first query takes a full minute (while it loads the index).

What do you mean "while it loads the index"? Are you saying your database reads the entire index into memory before doing a search? What happens when someone else updates the file and your copy of the index is out-of-date?

so much for relational speed

Don't blame "relational" for your implementation's poor speed. My DBMS can find a record via index on a multi-million-row table in less than a second, even for the first search.

2008-07-08 Reply Admin

you're a jerk!!!! the real wtf is helping the irs

2008-07-08 Reply Admin

RealDatabaseDeveloper:
You know that "properly normalized" is pro-performance?
It's comments like yours that make real DBA's watch developers like a hawk. There is a reason we run Relational Database Engines everywhere these days, when you model the problem correctly and index it properly it runs very very well.

The Chief Architect for Digg would beg to differ with you. In a podcast interview ("Deep Fried Bits") he stated that they don't normalize every chance they can, because all those JOINS are too much of a performance hit. There IS such a thing as over-normalizing. Even a DB 101 textbook will tell you that.

Both of you: stay away from absolutist statements. Please.

2008-07-08 Reply Admin

w:
andrewbadera:
Peter:
I bet the tables also weren't properly normalized.

Define "properly normalized." You do know that normalization is anti-performance, right?

Not always

In case it hasn't already been pointed out to everyone's understanding, normalization is most valuable when there are records being frequently added and removed from the database.

If the database is primarily for read-only, reporting or data mining, it is often more efficient to carefully denormalize and even sometimes aggregate the data (although aggregation wouldn't make sense in this case because this system is looking up individual tax returns).

In a nutshell: lots of transactions: normalize. Lots of lookups with few transactions: denormalize.

2008-07-08 Reply Admin

"I'm from the government. I'm here to help you - very, very slowly"!

2008-07-08 Reply Admin

I'm surprised no one has raised the data integrity issues. A relational database with no keys--and therefore no unique key constraints--is sure to contain duplicate data. We can probably conclude no foreign key constraints were in place, either (it would be bizarre to encounter a total lack of indexing, yet scruplous declarative R/I); therefore, much orphaned data is also virtually certain.

Our hero enabled the retrieval of results far faster; unfortunately, a huge effort still remained to enhance the quality of those results.

2008-07-08 Reply Admin

Normalization is a great thing first and foremost. It serves to reduce null data, save space, and in my opinon most important guarantee data integrity. A well designed database does not allow data to be entered that violates the data model. That being said in doing this often times things have to be separated into multiple tables. When you do this you are forcing the query maker to use either a join or a subquery (hopefully a join).

Now a well defined database speaks for itself, it represents the data model perfectly and it's a beautiful thing. A poorly designed database is job security ... because it makes knowledge transfer a pain in the ass if not impossible. Also it makes the database useless beyond a certain scope... the uses become too great for only a few people to know and the system is limited and productivity suffers.

Now, notice I said designed.. not highly normalized. Personally I like to normalize all of my tables to 5th normal form. Of course this is a very academic level and I don't always get 5th normal form. It is in some cases I have to go lower, and when I do this I have a specific reason for doing so... IE the datamodel doesn't lend itself to actually work in 5th normal form or for maintenance purposes. All that really matters is that the data's integrity is upheld by the database.

In this day in age we have some pretty sweet computers and believe you me if you feel the need to denormalize a transactional database for "performance" you're doing something wrong and your name belongs in one of these here little stories. Of course remember I said transactional... a data warehouse is denormalized and can still be well designed ... but a DW serves a very different purpose than a transactional database.

2008-07-08 Reply Admin

As for us US taxpayers, we benefited as well since we're getting taxed faster!

Nah, we're getting taxed at the same speed, but when we need to request a copy of our old returns, it takes a few days instead of a few weeks. And our tax dollars are paying fewer employees at the IRS, since they don't need people warming the seats while the queries are made.

Oh wait, it's the IRS. Never mind.

obediah · 2008-07-08 Reply Admin

merreborn:
andrewbadera:
Define "properly normalized." You do know that normalization is anti-performance, right?
Not always. The first database I worked with used the same 64-char varchar as the primary key on each of three related tables.
Storing the varchar in the first table, adding a serial integer primary key on that table, and using that id on the other two related tables was a huge performance boost.

You performed two operations here, only one of which was normalization.

Step 1. Normalization - replace primary keys in tables 2 and 3 with foreign keys to table 1.

Step 2. Optimization - introduce a surrogate key for table 1, and update incoming foreign keys.

In practice you combined these two steps into one operation, but it isn't fair to attribute the performance gains of introducing the surrogate key to normalization.

2008-07-08 Reply Admin

Ken B:
Don't blame "relational" for your implementation's poor speed. My DBMS can find a record via index on a multi-million-row table in less than a second, even for the first search.

Exactly. I have a mysql database that I threw together quite quickly and without knowing too much about mysql optimization (and the optimizations I did were not for fast single row lookup). It also takes me under a second to get a record in a 70+ million row table on the first search.

2008-07-08 Reply Admin

FlySwat:
And yet, even today the Ruby on Rails guys don't know what a index or a primary key is.

But that's what memcached and reverse proxies are for stupid! Why should the Rails developers have to think when they can just add another server... Databases are just a necessary evil. Just use XML and be done with it (or YAML of course)

FredSaw · 2008-07-08 Reply Admin

SagaCiter:
Our hero enabled the _retrieval_ of results far faster; unfortunately, a huge effort still remained to enhance the _quality_ of those results.

"Hastening an Inevitable II: The Enhancing" -- coming soon to a DailyWTF forum near you.

One story at a time.

TheDude · 2008-07-08 Reply Admin

Ken B:
Don't blame "relational" for your implementation's poor speed. My DBMS can find a record via index on a multi-million-row table in less than a second, even for the first search.

Ken, no-one really cares about your db. Let's not get into a "my database is faster than yours" argument.

2008-07-08 Reply Admin

Nice one. Wonder what the qualifications of the DBAs were? And I am also wondering the time taken for Bobby to do the indexing on the tables!! It would take quite a time to re-index the tables. Fabulous nonetheless... hope he got a pay hike and the job of the DBAs.

2008-07-08 Reply Admin

Peter: Saying that "normalization is anti-performance" is an over-simplification--managing redundant data also kills performance, and normalization can reduce that. On the other hand, pursuing normalization for its own sake can certainly reduce performance eventually.

A general rule of thumb that I hear and see practiced in the industry is to design for 3rd Normal Form, then "turn the dial back a quarter turn" to roughly 2.75 Normal Form (meaning denormalize just a little in the handful of places where performance is measurably subpar).

Of course, in this case, we have to wonder whether there was any normalization or even design worthy of the title.

Rumblefish · 2008-07-08 Reply Admin

Exactly. Making a blanket assumption that normalized databases are anti-performance is a sign of noob-level understanding. It all comes down to where you want the performance: reading, writing, aggregating, etc. If you are looking for quick reporting, then by all means de-normalize the structure. But, otherwise, flat database structures become unwieldy and full of redundant data. This is why a lot of large organizations -- namely, banks, insurance agencies, government agencies, etc. -- maintain a normalized "storage DB" and a de-normalized reporting/read-only DB: usually, you want performance on both ends of the spectrum.

real_aardvark · 2008-07-08 Reply Admin

Andrew:
andrewbadera:
Peter:
I bet the tables also weren't properly normalized.

Define "properly normalized." You do know that normalization is anti-performance, right?

No, good normalization does not degrade performance. Suppose a table has 2 million rows & two FOREIGN KEYs. Both speed & space can be imporved.

One key points to a VARCHAR(200), which is stored only once. The table's disk space holds only a 4-byte integer, or at least 200/4 = 50 times less space.

Users often search on both these FOREIGN KEYs. The database engine can much more easily index integer keys than VARCHAR(200) strings or most other datatypes. Also, some databases automatically create indexes on PRIMARY or FOREIGN KEYs. The database performs faster due to this normalization.

Yes, the database will perform faster.

No, this is not normalisation in any way, shape or form.

No, in the non-degenerate case, a database will create indices with equal ease, be the key an integer or a VARCHAR(200).

I believe the term you are searching for is "caching." Since this applies equally as well to in-memory data structures, most mere programmers are just as comfortable with the concept as the mighty DBA.

2008-07-08 Reply Admin

This is the IRS: I'd guess that at least one person in that room might have had a gun. If I'd had one, I'd have said, "STOP! Touch one more key and I'll kill you."

But then, I've always wanted to say that to somebody. Perhaps that would have been an overreaction.

M

2008-07-08 Reply Admin

Bob Dole:
Didn't it say they had DBAs? Exactly what qualifies one to be a DBA if they don't even know enough to have something as basic as PKs and Indexes?

There are many types of DBAs out there. The two primary types are the "I've scheduled your backup" and the "I've normalized your database"

The first is truly an administrator. He knows how to bring the DB up and down, expand tablespaces, get the most recent backup tape off-site, etc. He's great for a stable (aka legacy) system.

But if you're making active changes, like in a system that still under development, you want the 2nd type. Unfortunately, they're far rarer. If you find one, do your best to keep him.

2008-07-08 Reply Admin

F: Performance is not improved by simply adding indexes, it comes from adding the "right" indexes. That is supposed to be one of the hallmarks of a decent DBA. Adding indexes actually reduces performance on data writes (insert, update, delete) while (hopefully) improving performance on data reads (select).

Aaron · 2008-07-08 Reply Admin

SoonerMatt:
Are you working with a POS system?

No, but same idea - a realtime data collection system with tens of thousands (soon to be hundreds of thousands) of collection points. To make things even more fun, huge chunks of the data have to be analyzed sequentially - due to the nature of the data itself, not limitations of the design.

We may not be Facebook or Twitter, but not many people are, and I always get a good laugh out of developers who casually dismiss scale and performance issues in favour of the latest programming fad.

Oh, and to add to the current tangential debate - normalization can be a curse or a blessing depending on what it is that you're trying to do. There are some things we used to have normalized that have been denormalized due to the performance costs; however, I've seen plenty of other databases that made me wonder if the designers had even heard of 3NF.

savar · 2008-07-08 Reply Admin

Peter:
I bet the tables also weren't properly normalized.

Awesome comment, Hansel.

Hastening an Inevitable

Leave a comment on “Hastening an Inevitable”