- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Just think if we had to add some... now everything might not even align correctly depending on the sequence of events.
Admin
I'm very surprised that being so concerned about space optimizations, the original database developer didn't use as much as even a simple rle compression on the data of this column.
Admin
one database I worked on had to be unnormalized because the normalized form would have taken maybe 10 times the disk space that the unnormalized form took.
RDBMS's are a tool, sometimes they're the wrong tool. Fortunately, the major vendors are catching on, even Oracle now has repeating groups.
Look at the typical example, customers, orders and payments, say each customer could have thousands of orders, and have made thousands of payments, but would only have one balance due column per customer, why would you keep totalling up these orders and payments instead of just maintaining a balance column per customer everytime they added a new order or payment? Yet this is exactly the sort of thing I hear database purists rail against.
Admin
Segal's Law
Admin
LOOOOL - been a while since I laughed this well.
Admin
Ummm,wtf... unormalized = duplication, normalized = no duplication... so how the **** would a normalized database take up 10x the ammount of an unormalized.
Sum(orderAmmount)
It's not hard... so why bother maintaining a seperate record? unless we are talking about having many disparate systems such as in a bank where we would have a dozen or so systems that need to run a number of batch jobs to agregate information from various data sources to calculate the end of day balance.
Admin
Oh, hush. The "SQL by string concatenation" anti-pattern can be implemented in ANY language, and PHP has had DB packages that support parameter substitution for years now.
Admin
That can't be right. Normalization, by definition, REDUCES the amount of storage a database requires, since the amount of redundant information is minimized; Any atomic piece of data will be stored exactly once, and relational references to that data almost always take up less storage than copies of data (consider a varchar(2000) vs. an integer).
Denormalization is useful in the opposite case of what you described -- when you need to simplify complex relational queries in order to gain raw performance, and you have extra disk space to spare.
Admin
I dunno... if the assumption does hold that there would never be more than 2000 product lines, and you only ever need to get the status of particular items (as opposed to items matching...), that's really not the worst way it could have been done.
Admin
Heh, now there's an idea for a web site...sort of like the opposite of LinkedIn.com.
The only thing is, I think I'd get too depressed by the number of idiots within 3 degrees of me.
Admin
That's so elitist - assuming that people who write a language don't know about the dangers of SQL injection and how to get around it.
(hint: in that case, if it's a known integer (i.e. settype($prodID, "int") or (int)$prodID or $prodID += 0, it's an int), you won't have to worry about injection there.
Admin
The total is always equal to the sum of the individual records. If you were unable to update the total despite updating one of the individual records being summarized by the total, then your update failed, and you should roll back. Thats what transactions are for.
Additionally, I could have a watch that was an hour off. Even if it was my only watch, I would be pretty foolish to ignore all the other watches that were actually correct. So obviosly Segal was one silly fellow, he should leave lawmaking to those a bit more in touch with reality!
Admin
Not a problem if the database does the work of updating the column for you...although there are still many reasons not do this (it makes the parent row a concurrency/locking hot spot, for one).
Admin
Has this technique been patented? If not I would like to use in my next DB design.
Also I'm trying to use a FOR NEXT loop in the WHERE clause of a SQL statement but I get a syntax error. Our DBA is a real jerk and won't help, does anyone have any code they can send me?
Admin
Heres a simplification of the database, it was an attendance tracking system. Now, for each instance of "attendance", I need the student id, the course id, the date of the meeting and the attendance. The student id in this case is 9 digits, the date is 8 digits, the class is 8 digits, and the attendance value could be as small as ONE BIT (but I went ahead and used one digit).
Now while you can optimize somewhat by reducing the size of the combined key, you can instead track attendance as a repeating group for a big win on storage space once your class size exceeds just a few students or meetings. (If you do allow repeating groups then the entire string of attendance values could be just one variable (array), and in that case it would be easily normalized. However, repeating groups (arrays) in a record throw some RDBMS weenies into fits!)
Admin
Thanks. Now everyone within the sound of my laughter knows that I am not working.
:)
Admin
The proper way to perform non-trivial refactoring like this is as follows:
Convert programs, one by one, from accessing the stupid format to accessing the sane format. Take as long as you need.
Convert the sane format to a real table and ditch the stupid format:
Admin
Because row and column storage (and retrieval) costs are not the same.
The level of ignorance displayed in the main article is impressive, but the ignorance displayed here by DBA's concerning the implementation details of their own databases is downright astonishing.
As a rough guide, the marginal cost of adding a redundant integer column to a row in a table you already have is between 10 and 100 times less than the cost of adding a non-redundant column of the same size to a row in a table you don't have yet. This factor depends on whether you need to have indexes and how much overhead your RDBMS tacks onto each table row. In some cases other details like whether your rows are fixed width or not come into play. (*)
VARCHAR(2000) columns may take anywhere from 2 to 8000 bytes of amortized average storage space, depending on the actual length of the string stored and details of the implementation. Some RDBMS engines actually do common-prefix optimizations, LZO, or similar compression on large string and array objects in order to trade I/O time against CPU time, while other engines translate ASCII strings into fixed-length uncompressed UTF-16, add a few bytes of overhead, and then round up to the nearest page size. Some do things that are even less efficient.
Adding 2000 child rows containing two integer columns and an enum type with indexes to support joins will occupy megabytes of space per row in the parent table, and lose advantages that could be gained from having locality of reference or transparent data compression in the RDBMS. If the application requirements only include this information for application-side decision making (e.g. for enabling UI features as opposed to generating reports) then the huge costs of normalizing this particular data structure result in no actual benefit.
There isn't a single rule that can be applied here. "Pure" normalization is nice, but like anything it has appropriate and inappropriate uses. It could turn out that the original developer did some prototypes and discovered that the best approach (at least at the time) is the one that is in place now. On the other hand, maybe there are only 100 rows in the parent table, so the cost of 100 times more storage is acceptable. "Huge storage costs" might mean an actual increase in database size of less than 1GB. We simply don't have enough information to judge.
(*) Presumably the RDBMS implementation could (in theory) translate the "normalized" DDL into an internal implementation that looks like this CodeSOD's data structure. The closest to that I've seen to that capability in a general-purpose RDBMS in the field only supports read-only tables, though, and everything people have posted here so far about slow queries applies to this approach.
Admin
Wow. You're totally right, but did you dip your pants in hydrazine this morning?
Admin
Please post the DB design so we can tear it apart.
Repeating groups with native support from the language might not be a WTF, e.g. "select * from the_table where list_of_foo contains element"
As for a balance column in the customer table, provided that it's maintained automatically via triggers, that's not a WTF either. However, for other reasons (e.g. printing monthly statements), you should think about giving the orders/invoices/payments/etc. tables an Open/Closed status flag with an index on it.
Admin
mysql_query allows exactly one query per call, so unless you're executing an entire client-side generated sql string, the worst you'll get is an error.
If the original example in your quote had evaluated to
php would return an error.
I'm ok with bashing languages and all, but at least learn how the standard tools in the language work before complaining about them.
Admin
So, you're saying you don't approve?
Admin
Wow, someone got up on the wrong side of the keyboard this morning. His solution works - it's just prohibitively expensive. This, coincidentally is how any good solution starts. Then it is revised until it is acceptable, right?
Admin
How anonymized was this this example? I was given a site to maintain that had products/product lines available to agents, database "designed" by someone who created columns with names like AgentThreeAddressFive, and his initials were the same as a common abbreviation for a bodily function...
Admin
wow..... looks like someone needs a college education, obviously high school wasnt enough
Admin
Admin
Admin
Actually, I'd rather want to work with Emil than with you. Emil proposed a solution to the problem that might, after fixing minor syntax bugs, actually work. It's not a very beautiful solution, but in the given setup, nothing is beautiful.
As Emil suggested, it might be even faster than the normalized database schema every normal person (including you, Emil and me) would choose. Of course the query requires a full table scan, but since there is (most likely) at most one row per agent, and the number of agents is most likely in the range of a few hundreds to a few thousands, the database will easily handle that.
So there is Emil, who proposes a solution, and KenW, a complacent blowhard who just berates other people. And to add insult to injury, said KenW's post gets featured.
Admin
Nothing can school developers about maintainability - and finding the all-important balance between "getting it done" and "doing it right" - quite so effectively as a few solid years spent maintaining and adding new features to their own houses of cards. A few gray hairs later, and you've got a perfectly serviceable intermediate programmer.
Admin
Admin
What is a possible solution in a case like this? A cronjob that creates a REAL relation once a day? I think I'd have to rip that apart and do it over.
Sometimes you don't want to mess with code that "works" but other times just rewriting will save you so much pain later.
Admin
I tottally disagree with you. If you totalize everything yourself and maintain the db number, it defeats the purpose. A query can easily return the correct data. If you do it through code, its easy to get everything out of sync. One failed write, and you would have to actually reread the items and totalize them again to get the correct information. Databases are meant to store data, and provide a mechanism to use the data in any way you see fit (queries, reports) to use for information and analysis purposes. Maybe back in high school an unnormalized database would have fit the bill, but to have a scalable solution, any good db developer would agree normalization is imperative.
Admin
What's wrong with refactoring the code? This is terrible and unmaintainable, so get rid of it. Besides, someone's already propeosed a migration strategy off of this POS.
Admin
The true brillance (tm) of the design is the fact that it uses Nvarchar, rather than just regular varchar. Sure, he may be "wasting" an extra byte for every row and product line, but with over 65,000 characters to choose from this thing must scale to unimaginable (or is that "unmanageable"?) levels of complexity! The original designer really thought through that whole "takes up too much space" problem.
Admin
You cannot fail the write with an RDBMS that supports transactions. If you can, then a simple accounting app can debit one account but then fail to credit another account for a balancing entry, leaving the entire chart of accounts out of balance, yet according to you, this is somehow more acceptable than updating a summary record in the same transaction that adds a detail record that would be totalled in the summary record?
Ok then maybe you mean that I forgot to include the update in my code, but even then, isn't that just like forgetting the balancing entry in my accounting application? Heck, I'm in even better shape than your accounting app, because I can simply put my code in a trigger and forget about it, whereas you have to actually SPECIFY TWO SEPARATE ACCOUNTS to update ALL THE TIME.
This is like an application changing my street address but somehow "failing" to update my city, state, and zip. Either your RDBMS can do atomic updates to a practical level, or it can't. Make up your mind, THEN come back and school me if you want.
Admin
Sounds like someone forgot their Midol today...
Admin
I don't think you can create indexes like that (and even if you could, I wouldn't create 200 indexes on it). Much better to add a product line table and a join table between sales critters and product lines:
SALES_GUY: existing table - nuke the varchars after we migrate the code to not use it PRODUCT_LINE: stuff describing a product line SALES_GUY_PROD_PERMISSION_DESC: 10 rows describing what each state means SALES_GUY_PROD_PERMISSIONS: fk to each of SALES_GUY and PRODUCT_LINE column for permission, fk to SALES_GUY_PROD_PERMISSION_DESC
add an index on the join table and now everything is fast and sane.
Admin
Here's my crummy view that pivots things like this:
create view viewAgentProductLines as select a.intAgentId ,substr(a.strProductLines,pl.intProductLineId,1) strAgentProductFlag ,plflags.description from agency as a ,productLines as pl ,productLineFlags as pflag where substr(a.strProductLines,pl.intProductLineId,1)=pflag.allowance_code;
Please Note:
Admin
I'm so glad someone said this. I don't work as a programmer by trade, but KenW and all the others screaming about this strike me as incredibly lucky, incredibly naive, or they've just never held down a real job.
I'm betting that emil knows the definition of "deadline" and knows to get a working solution first, THEN ask to have a meeting about refactoring, involving the people affected by such a refactoring...if you dive in like a maverick and start refactoring without a care in the world about the repercussions, well, I hear McDonald's is hiring...
Admin
Admin
If they stored the SQL strings required to access all the information in the agents database in another table called "agent_loopup_SQL" then it would be easy to access the info!
Admin
Well, I would use something like this:
INSERT INTO old_dev SET rectum = largebluntOBJ WHERE largebluntOBJ = '1964 Cadillac Hearse';
Yes, I know it's not exactly a valid SQL statement, but it makes me feel better.
Admin
Consider this: once upon a time I saw a dev team deliver an app to a client. This app was quite pretty, and worked very well, for a few days. Client management was so impressed with this app, they decided to let it handle all their archival data, too, and dumped it all into the database, all 12 gigs worth.
Now, a properly designed database would take that 12 gigs and say "Yeah, what else ya got? Bring it on!" But this database was designed by a naive development team who figured that things like normalization were just something some fat grey-beard pulled out of his ass one day when he was bored. They were also absolutely certain that this app would never have to handle more than a few thousand records.
I found out about all this when the client called me up dismayed. When the app was doing some task and it was going to take more than a few seconds, it would helpfully display an estimated time 'till completion. For a couple of common tasks, the time 'till completion displayed was ... I shit you not ... several days.
Sometimes things change, businesses change, go in different directions. Robust solutions just keep on working, brittle ones break.
Admin
Please, for the sake of all that's good in the world, find yourself a line of work far removed from other people.
Admin
So I give up... do you pronounce "unormalized" as "Un" + "ormalized" or "U" + "normalized"?
At my last job, I was asked to help the guy in charge of the Payroll project. When I saw columns like the one in the article in his database, I nicely explained to him that he could handle the data retrieval and I would be glad to help with the UI and business tiers.
Admin
Admin
This, however, is a ludicrous piece of self-justification.
To be interested in the amount of disk space required, one of two things should apply:
(a) You can't afford an 80G disk or (b) Your application gets so many hits per second that locality problems ensue.
I doubt either applies, in your case. Or in any comparable case. But let's just give you the benefit of the (huge) doubt.
Your date fits into a BIGINT (strangely enough, it even fits into a DATE); your class fits into a BIGINT; even your student id (and I applaud any university or college that plans to have more than 4,000,000,000 students attend somewhere between foundation and having the plug pulled because of a moronic computer-based accounting system) can be fitted into a BIGINT using fairly simple compression techniques. Thanks for losing the seven bits, though -- that'll buy you two coffees and a double espresso latte.
Have you considered the other benefits of using an RDBMS (properly, that is?) The reason that repeating groups in a record throws some RDBMS weenies (also non-weenies, also ordinary programmers like me) is that it's just plain cretinous; no gain, much pain. I'd advise a simple ISAM database, but even that fails at the same hurdle.
Are you insane?
Admin
YAGNI!
Admin
So you get a kick out of insulting other people, huh?
Admin
Heh. Show me the tiny trivial databases you work on. It's nothing, and it reflects that the guy who built it didn't want to use/depend on the SQL database.
It's trivial to take that field and create a display of who gets what on what lines programmatically. I'll bet that when he wrote it, it got the job done. New guys always thumb their noses at the guys who came before them.
Rule #1 for assessing programmers: The ones who are condescending suck at getting things done. I usually get their jobs (or at least their projects)...
Databases are quite cool, but naive programmers often unload their work on the DB and stuff it full of records - which bog down the system. Normalization is not always the solution - but you'll learn.
Anyone want to speed test the string thing? And why the hell would you index a column containing 1 character, especially when it only has 11 or 12 possible values. That's just nuts!