Roll Your Own Clustered Index

It was a slow day for Levi C. His company had an excellent staff and meticulous development and testing procedures, so calls for maintenance were rare. He was beginning to feel like the Maytag repairman of maintenance programmers. Fortunately, he was delighted to get a call from one of his favorite clients.

"Hi Levi, if you have any spare time, could you check out this report that's running slow? It's for a system we had developed by another company. It's really no big deal, just if you have a second..." Bored and eager to help, he got all the information he could about the issue. He tried running the report for the previous month, and it was indeed running slow. He ran the report again for the past two months, and it timed out.

Because this system was new to him, he called up one of the old developers. "Oh, yeah, one or our developers forgot to initialize the database a few months back." Forgot to initialize the database? Levi thought.

Levi checked the query used to run the report. It had several thousand UNIONs that combined CustomerHistory_2007_02_75, CustomerHistory_2007_02_74, CustomerHistory_2007_02_73, and so on. Talking to one of the developers, he found that the tables were named to include a year, month, and customer ID. Forgetting to "initialize the database" meant failing to create a CustomerHistory table for each customer ID that month, or forgetting to update reports that queried the thousands of tables in the database. No, creating the tables wasn't automated - it had to be done manually.

The developers absolutely had to remember to create a new table for each customer at the beginning of each month. If they forgot to add the tables, the whole system would fail over. What's worse, their database server was configured to cache table definition data for only the last 64 accessed tables - a far cry from the thousands of tables commonly used in the system. Each table had the same schema as all of the others, including one autoincrementing ID field, and each table started the index from 1. If you wanted to retrieve a particular row of customer data, you'd have to know the year, month, customer ID, and row ID. If a column was to be added, thousands of tables would need to be changed.

Why was this implementation chosen? The developers either didn't know about or didn't trust in MySQL's ability to cluster data. By creating thousands of tables, they'd created their own method of clustering data. As for their plan to optimize the database, you know what they say - "aim for the stars, hope not to get the exact polar opposite of what you were actually trying to accomplish."

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!