- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
i have a database with 800mbytes of data and around 6 millions rows. and "a few minutes" for schema change is 30+, the indexes take 600mbytes and doing the first query takes a full minute (while it loads the index).
so much for relational speed
Admin
You're right, I thought about this afterwards. Linear search would be faster, so I have to assume that's what would be used. It's still ~2mil vs 21 (binary search is still much faster if the list is sorted). I also used the words "instructions" loosely. Obviously if all the information is in cache, or even ram, 42mil really isn't so bad. I was talking instructions as in the big-O notation definition.
Admin
You might want to refine that to "You know that normalization is anti-performance for Non-Aggregate SELECT statements provided the Logical-Physical Confusion persists on the RDMS platform of your choice."
Admin
A brilliant and breathtaking tale. Until the end I kept wondering what would happen! O_O;
Admin
Normalization is anti-performance only when there is a JOIN involved. In other situations it can actually IMPROVE performance, because you have a smaller set of data.
Admin
No, good normalization does not degrade performance. Suppose a table has 2 million rows & two FOREIGN KEYs. Both speed & space can be imporved.
One key points to a VARCHAR(200), which is stored only once. The table's disk space holds only a 4-byte integer, or at least 200/4 = 50 times less space.
Users often search on both these FOREIGN KEYs. The database engine can much more easily index integer keys than VARCHAR(200) strings or most other datatypes. Also, some databases automatically create indexes on PRIMARY or FOREIGN KEYs. The database performs faster due to this normalization.
Admin
I know that normalization and performance are two distinct, although moderately correlated dimensions for evaluation database design.
I also know that making blanket statements in the squishy parts of CS is a recipe for fail.
If you were a "real DBA" you would know that modeling the problem correctly and normalization aren't always the same thing. Given the same blob of data, a database built for management/integrity is going to look very different than one designed for mining.
Admin
Interesting...
I thought that an index on a column basically created a hash table, which relates each field in that column to its corresponding row in the table. This would make inner joins on indexed columns a linear time operation in the total number of rows in the smaller table.
Your approach, however, is better suited if a column doesn't have a unique constraint, or for running queries with a less-than / greater-than operator on that column.
Admin
Normalization makes no sense unless a JOIN is involved! Do you know what normalization means?
Look up the several kinds of JOINs in SQL. INNER JOIN acts like a WHERE clause, and it's the one most people understand. The outer joins, LEFT JOIN & LEFT JOIN syntax, are also important.
Admin
This story reminds me of a problem I had to solve once. Seems there was this billing process that took an awfully long time to run. In fact, as the DB grew, the time to process had slowly grown to over 24 hours, which simply couldn't be accepted since the billing cycle was 24 hours.
Didn't turn out to be an indexing problem, though. Seems there was this query that returned all the rows sorted by a Time field, just so that the program could use the first row and ignore all the others. Already tremendously bad, given the size of the DB. But this query was run ONCE PER ROW. If the table had a million rows, you'd read in the entire million rows a million times!! A simple change the query to use a MAX() and the process was reduced from 24+ hours to less than 24 seconds.
I didn't get a party, though. Turns out they could barely acknowledge what I'd done, because the entire process had been coded by a "golden boy" who received accolades, plaques and bonuses on a weekly basis. Guess it's easy to look like a hero if what you first produce is pure crap, but does manage to get the job done. A little praise up front, and lots more when you improve the system by fixing the numerous bugs you initially put in to production. Combined with the ability to BS, and you get "golden boy" syndrome.
Admin
There are several ways to index data, based on what kind of searches you want to do. Learn more from the open source databases.
See the Spatial Index feature in MySQL. It uses the R-Tree (Rectangular bounding-box Tree) to limit 2D searches to a smaller area.
See the Index Types in PostgreSQL. Notably, PostgreSQL uses GiST indexing for spatial queries, since its R-Tree implementation wasn't good.
Admin
I had this problem some time ago, I imported some tables from excel, using a little utility, I normalized the database, but somehow, I forgot keys, I had this same problem, queries took too much time, and I remembered the keys thing, and I had a similar performance boost.
Admin
... aaaaaaaaaaaaaaaaaaaaaaand THAT'S why we can't have nice things.
Twerp.
Admin
Yes, but would they be fazed? (And "DBAs" does not require an apostrophe.)
Admin
In all this talk about normalization, how is it no one's brought up 3rd normal. Of course even a perfect 3rd normal database isn't necessarily going to give you peak performance. But generally, a sensibly normalized database combined with the right indexes will make a huge difference.
Having said that, I've run into a case or two where denormalizing was the answer.
With regards to foreign/primary keys, I know that Oracle enforces their uniqueness by drum roll... creating indexes, and I suspect most other databases do the same.
Admin
Admin
The best part about this story is that it brought all the old-school technical people out. Lately, there have been far to many business-related WTFs where all the comments have been cliches and stale memes. It's refreshing to see a heated discussion about programming and computer science again!
Admin
Try ~175 billion rows in a vector db (nine years of stock market records).
Admin
I've run into that too. But...
Every single time we've ever had to denormalize something for speed, eventually the braindead peons get ahold of it, add a few dozen columns because they're just following what someone else did (rather than thinking about the problem critically), and completely ruin the speed benefits.
So eventually we end up renormalizing it in the end anyway, and the code ends up being a horrible mess for having changed back and forth.
It's not worth it. Unless you have an awesome DBA who will always be there and is a complete dictator, some retard is going to come along and ruin the model if you don't keep things normalized properly. Sad but true.
Admin
A contract programmer I worked with once regaled me with a similar story. He'd been called it to fix a "slow cron job" on a PHP system that was taking an hour and a half to run. Trouble was, it was run once an hour.
Adding indexes to tables, an act that took him less than 10 minutes, reduced the cron job's run time from an hour and a half to 6 minutes. That's a 1500% increase in speed.
The client he was working for (who had written this whole system himself) was astounded.
Admin
This is the government. To be a DBA you have to, (a) have a computer science degree, and (b) be very skilled at filling out the paperwork.
Desire or ability to build quality computer systems is a serious handicap to anyone working in a government IT job. Those kind of people tend to not be willing to spend their entire day filling out paperwork describing what the system would do if we ever got around to actually building it, how it would interact with all the other systems that will never actually be built, and how it would comply with all the applicable regulations written by bureaucrats who have no idea how to build computer systems. Those people always want to rush off and really build something. They cling to naive ideas that one could design a computer system with less than 10 pages of documentation per line of code, or that there should be more programmers than managers on a project.
I was there for 13 years.
Admin
Admin
Storing the varchar in the first table, adding a serial integer primary key on that table, and using that id on the other two related tables was a huge performance boost.
A poorly designed, unnormalized schema can easily be far less performant than a normalized one. To denormalize, you must first understand normalization. If you, like those in this story, don't even understand the necessity of indexing, then you're probably not going to successfully come up with a performant denormalized schema.
Admin
but if the DBA didn't know about indexes and primary keys, what makes you think he'd know about a drop table command?
Admin
I don't, and I have a hard time believing this. Can you explain?
Rob
Admin
Excel spreadsheets are even faster than flat files.
Admin
Thanks, jacka$$! :)
Admin
Not always
Admin
Admin
you're a jerk!!!! the real wtf is helping the irs
Admin
The Chief Architect for Digg would beg to differ with you. In a podcast interview ("Deep Fried Bits") he stated that they don't normalize every chance they can, because all those JOINS are too much of a performance hit. There IS such a thing as over-normalizing. Even a DB 101 textbook will tell you that.
Both of you: stay away from absolutist statements. Please.
Admin
In case it hasn't already been pointed out to everyone's understanding, normalization is most valuable when there are records being frequently added and removed from the database.
If the database is primarily for read-only, reporting or data mining, it is often more efficient to carefully denormalize and even sometimes aggregate the data (although aggregation wouldn't make sense in this case because this system is looking up individual tax returns).
In a nutshell: lots of transactions: normalize. Lots of lookups with few transactions: denormalize.
Admin
"I'm from the government. I'm here to help you - very, very slowly"!
Admin
I'm surprised no one has raised the data integrity issues. A relational database with no keys--and therefore no unique key constraints--is sure to contain duplicate data. We can probably conclude no foreign key constraints were in place, either (it would be bizarre to encounter a total lack of indexing, yet scruplous declarative R/I); therefore, much orphaned data is also virtually certain.
Our hero enabled the retrieval of results far faster; unfortunately, a huge effort still remained to enhance the quality of those results.
Admin
Normalization is a great thing first and foremost. It serves to reduce null data, save space, and in my opinon most important guarantee data integrity. A well designed database does not allow data to be entered that violates the data model. That being said in doing this often times things have to be separated into multiple tables. When you do this you are forcing the query maker to use either a join or a subquery (hopefully a join).
Now a well defined database speaks for itself, it represents the data model perfectly and it's a beautiful thing. A poorly designed database is job security ... because it makes knowledge transfer a pain in the ass if not impossible. Also it makes the database useless beyond a certain scope... the uses become too great for only a few people to know and the system is limited and productivity suffers.
Now, notice I said designed.. not highly normalized. Personally I like to normalize all of my tables to 5th normal form. Of course this is a very academic level and I don't always get 5th normal form. It is in some cases I have to go lower, and when I do this I have a specific reason for doing so... IE the datamodel doesn't lend itself to actually work in 5th normal form or for maintenance purposes. All that really matters is that the data's integrity is upheld by the database.
In this day in age we have some pretty sweet computers and believe you me if you feel the need to denormalize a transactional database for "performance" you're doing something wrong and your name belongs in one of these here little stories. Of course remember I said transactional... a data warehouse is denormalized and can still be well designed ... but a DW serves a very different purpose than a transactional database.
Admin
Nah, we're getting taxed at the same speed, but when we need to request a copy of our old returns, it takes a few days instead of a few weeks. And our tax dollars are paying fewer employees at the IRS, since they don't need people warming the seats while the queries are made.
Oh wait, it's the IRS. Never mind.
Admin
You performed two operations here, only one of which was normalization.
Step 1. Normalization - replace primary keys in tables 2 and 3 with foreign keys to table 1.
Step 2. Optimization - introduce a surrogate key for table 1, and update incoming foreign keys.
In practice you combined these two steps into one operation, but it isn't fair to attribute the performance gains of introducing the surrogate key to normalization.
Admin
Exactly. I have a mysql database that I threw together quite quickly and without knowing too much about mysql optimization (and the optimizations I did were not for fast single row lookup). It also takes me under a second to get a record in a 70+ million row table on the first search.
Admin
But that's what memcached and reverse proxies are for stupid! Why should the Rails developers have to think when they can just add another server... Databases are just a necessary evil. Just use XML and be done with it (or YAML of course)
Admin
One story at a time.
Admin
Ken, no-one really cares about your db. Let's not get into a "my database is faster than yours" argument.
Admin
Nice one. Wonder what the qualifications of the DBAs were? And I am also wondering the time taken for Bobby to do the indexing on the tables!! It would take quite a time to re-index the tables. Fabulous nonetheless... hope he got a pay hike and the job of the DBAs.
Admin
Peter: Saying that "normalization is anti-performance" is an over-simplification--managing redundant data also kills performance, and normalization can reduce that. On the other hand, pursuing normalization for its own sake can certainly reduce performance eventually.
A general rule of thumb that I hear and see practiced in the industry is to design for 3rd Normal Form, then "turn the dial back a quarter turn" to roughly 2.75 Normal Form (meaning denormalize just a little in the handful of places where performance is measurably subpar).
Of course, in this case, we have to wonder whether there was any normalization or even design worthy of the title.
Admin
Exactly. Making a blanket assumption that normalized databases are anti-performance is a sign of noob-level understanding. It all comes down to where you want the performance: reading, writing, aggregating, etc. If you are looking for quick reporting, then by all means de-normalize the structure. But, otherwise, flat database structures become unwieldy and full of redundant data. This is why a lot of large organizations -- namely, banks, insurance agencies, government agencies, etc. -- maintain a normalized "storage DB" and a de-normalized reporting/read-only DB: usually, you want performance on both ends of the spectrum.
Admin
No, this is not normalisation in any way, shape or form.
No, in the non-degenerate case, a database will create indices with equal ease, be the key an integer or a VARCHAR(200).
I believe the term you are searching for is "caching." Since this applies equally as well to in-memory data structures, most mere programmers are just as comfortable with the concept as the mighty DBA.
Admin
This is the IRS: I'd guess that at least one person in that room might have had a gun. If I'd had one, I'd have said, "STOP! Touch one more key and I'll kill you."
But then, I've always wanted to say that to somebody. Perhaps that would have been an overreaction.
M
Admin
The first is truly an administrator. He knows how to bring the DB up and down, expand tablespaces, get the most recent backup tape off-site, etc. He's great for a stable (aka legacy) system.
But if you're making active changes, like in a system that still under development, you want the 2nd type. Unfortunately, they're far rarer. If you find one, do your best to keep him.
Admin
F: Performance is not improved by simply adding indexes, it comes from adding the "right" indexes. That is supposed to be one of the hallmarks of a decent DBA. Adding indexes actually reduces performance on data writes (insert, update, delete) while (hopefully) improving performance on data reads (select).
Admin
We may not be Facebook or Twitter, but not many people are, and I always get a good laugh out of developers who casually dismiss scale and performance issues in favour of the latest programming fad.
Oh, and to add to the current tangential debate - normalization can be a curse or a blessing depending on what it is that you're trying to do. There are some things we used to have normalized that have been denormalized due to the performance costs; however, I've seen plenty of other databases that made me wonder if the designers had even heard of 3NF.
Admin
Awesome comment, Hansel.