Percy's employer is an "enterprise vendor". They have a variety of products all within the "enterprise" space. Like most enterprise products, they're sold on the strength of the marketing team's ability to claim with a straight face that they're secure, robust, reliable, and performant.
While the company offered a "cloud" solution for their enterprise suite, the real money was in the on premises version. With on-prem, any updates or upgrades were almost guaranteed to break the entire environment unless the customer shoveled huge piles of cash at the company to get a specialist to come out and do the upgrades.
That was good for the company, but wasn't so great for the specialists. Simple patches could take weeks to perform, not because of any trickiness or difficulty, but because each upgrade step could take days to run.
Percy found out about this when sitting in on a project review meeting.
"For our instant messaging engine," one of his peers, Elizabeth, explained, "just handling the room policy records can take upwards of ten hours at some sites. That's one SQL statement. God forbid the transaction fails for some reason."
"How could it be that bad? How many records is it?" Percy asked.
"Only about 100,000. The problem is the way the data is structured." Elizabeth pulled up the schema.
As an enterprise product, the messaging system prizes configurability and customizability over utility, performance, and common sense. It's not designed to solve a problem, but instead to allow customers to solve problems they didn't know they even had, mostly by inventing them. As a messaging/chat system, they had a concept of "rooms", and the "rooms" could have every aspect about them customized. From how usernames display in the chat, to whether files can be sent, to even whether or not copy-paste is permitted. Of course, the same interface which configured those cosmetic flags also configured role-based permissions, transport layer settings, and whether the messages were compressed before sending to the server.
And with each version of the product, the specific set of fields could change.
Those of you who work in RDBMSes are likely imagining a vast field of relations bound together by foreign keys and indexed to optimize likely query paths. Those of you who are more on the NoSQL side of things are likely imagining a sparse JSON/BSON document, possibly a heavily denormalized representation of the dataset.
Those of you who know enterprise products, or who have read this site before, know what it actually was: gigantic XML documents stored as CLOBs in the database. To add insult to injury, they used SQL Server on the backend, which has had a native XML type with optimized query operations since the mid-2000s.
Percy pulled up the schema on his own machine to explore it. At the center of the schema was a ChatRoomPolicy
table. This table had three fields- a PolicyId
and the RoomPolicy
, which was the main gigantic XML document. In addition to that XML pile, you also had a Visibility
column, which contained an XML document which contained ChatRoomVisibility
elements- essentially little things, like whether attempting to send a message included a "Cancel" button, or if users could see a "Mute" button to silence the notifications.
Then, of course, you had the ChatRoom
table, which had a foreign key reference to the policy which controlled the chat room.
"So, yeah, when we do an upgrade which alters the schema of the RoomPolicy
documents," Elizabeth explained, "we have to read every row, parse the XML in our upgrade script, generate the new XML, and then store it. That's if the XML changes- sometimes an upgrade only renames one key, and not every document contains that key."
Percy had seen some pretty horrible things in his section of the product, but nothing quite like this. "Well, at least you can just map the XML document to a class, and mostly automate the changes, right?"
Elizabeth laughed at him. "If we did that, then our customers couldn't make up their own XML keys, and then add their own procedures in our proprietary scripting language which reads them, and then I wouldn't constantly get tickets because some PM at a three-branch bank in East Pilsbury decided they wanted a new field in their chatroom and it happens to collide with one of the fields we already use."
More specifically, the XML didn't have an associated schema. The upgrade script had to be able to handle any arbitrary XML document, identify the parts that the script cared about, if they existed, modify them, if necessary, hopefully without breaking anything the customer had customized, and then actually do the update.
The resulting upgrades could, on a good day, get about 3 records processed every second. With some organizations, the pile of policies applied to chat rooms could extend into the hundreds of thousands (as "private chats" also counted as chat rooms and had policy documents attached). Thus, ten hours just to upgrade a single module of a single service in their enterprise software portfolio. And that's just the upgrade task.
"Wow, I can't imagine how long testing must take," Percy said.
Elizabeth laughed, again. "That's up to the client to schedule. They usually don't."