Filing Data

Databases are wonderful things. They help you organize your data at multiple levels. Large logical units of data are stored in databases. Smaller logical divisions are stored in tables, and individual records are stored in rows. Once you have your data stored like that, you can access it in all sorts of wonderful ways.

Of course, there's more to it than that. Beneath the database is the file system. Beneath that, there are one or more drive arrays each with one or more disk drives, each with one or more platters. Read/Write heads move back and forth to access the data on those platters. All sorts of sophisticated mechanisms have been developed to minimize the dead time of having a head in the right place, but waiting for the platter to spin the data beneath it for retrieval.

DBAs take this into account when defining where the database and included tables will be stored. For example, they shouldn't put two databases (for different applications) that require extremely high rates of access on the same disks so as to minimize contention, because a head can't be in two places at once. Sometimes, they'll stripe data for a single table across multiple spindles so as to minimize hot spot activity by hashing the data across partitions.

Mr. Anon. works with a DBA who believes that there is a better approach. Each database should contain one table. That database should reside in its own data file on the disk. Accordingly, there were thousands of data files. Since the operating system has a limit on the number of open file handles, the data server could only access so many databases at one time. The underlying theory was that this would reduce contention.

However, since the files are stored all over the disks, the heads-can't-be-in-two-places-at-once issue still causes problems. Since a table/database is stored in a single file, it's no longer possible to stripe it (as one might with partitions). Also, since the data server can only open so many file handles at once, in order to open more, it must juggle opening and closing database file handles. As one might expect, this overhead is substantially noticeable.

Of course, there is the maintenance side of things as well. Changing the log segment of a 200 table database is basically a one-off action. Doing it for a database that is stored as 200 separate databases of one table each now becomes a significant amount of effort. It's still O(n), where n is the number of databases, but now it's a huge effort. The same goes for every other database-level attribute.

A more interesting facet is that now it's possible for only one table of your database to go down while all the others are happily processing requests. Now you need to code multi-table inserts as multi-phase commits.

Equally interesting is how one might code a join of n tables, when it's possible that zero or more of them might be in databases that are unavailable. Hmm?

All of this was rationalized as acceptable because it was all to support a web application, where performance didn't matter.

Ah well, at least this way you know you'll never find a database with thousands of tables in it.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!