The Daily WTF: Curious Perversions in Information Technology

2022-04-20 Reply Admin

Prmtr optzn != ^

Steve_The_Cynic · 2022-04-20 Reply Admin

While we all know that premature optimisation is bad, it is rarely acknowledged that premature pessimisation is even worse.

nerd4sale · 2022-04-20 Reply Admin

Ah yes, the old storing relational data in a non-relational way in a relational database, for Reasons.

Besides, one record per device every 5 minutes, is only 288 records per device per day. Unless we're talking billions of devices, I fail to see the problem.

2022-04-20 Reply Admin

Currently cow-orking with a colleague who is dumping a line of debug info, once a day at around midnight, into a whole bunch of text files which are periodically harvested by a program I am developing.

Trouble is, we can't just grab copies of these files programatically according to our monitoring schedule, because he is holding them open. We have to jump through the irksome loop of taking a backup of each one, using a manual process that requires extra privileges.

I made the suggestion that he just open the file for append, write his line of data, then closed it again. He disagreed, on the grounds that the overhead of opening and closing a file would be an optimisation hit. He seems deaf to arguments that a) if he's got the file constantly open for write, the apps that need to access that file on a regular basis can't, and that b) holding files open 24/7 which you aren't doing anything with hogs resources.

Bah. Youngsters. Geroff me lawn.

2022-04-20 Reply Admin

If you are using a database as source control, it sounds more like you lost all control.

2022-04-20 Reply Admin

Unless we're talking billions of devices

Maybe this is the database that tells us how many devices run Java.

2022-04-20 Reply Admin

I got very similar experience in one of my previos contracts. Just the reason was different - Int overflow. You see, if you log location data for a lot of devices with high precision, and you have primary key INT32 for the position log, you got a problem. I tried propose either changing it to INT64, or using DeviceID + Timestamp instead - none of the proposals was accepted, but they created very same solution mentioned in the WTF (Only they thought about optimization, so they used binary with fixed length time/lat/lon records). It took half a year for two people to develop and test it all around. Reason why they choose this approach - making PK of INT64 (Or DeviceID + Timestamp) would consume too much storage ...

2022-04-20 Reply Admin

Missing requirement: the resolution of position fix required! They've already scrapped altitude, which is probably a fair compromise. I'd think that using some coordinate system (UTM/OLC?) other than lat/lon would be quite useful. I'd likely occasionally have a coarse position update and subsequent fine position updates, including some very compact way of saying "it hasn't moved," without worrying that degrees represent different absolute sizes depending where you are on the globe. Presumably they'd also want to keep track of "no position fix available" and "data deleted," which are also very compact.

105,120 records per device per year - not that bad. If you have 1M devices (I mean, we are talking about the famous Initech, right??), you're looking at 105 billion records per year. Good news is that you can easily store in a sorted order per device, so I would probably just partition by device, which also makes it really fast/easy to: load balance by hash across a cluster O(1), recall records by device O(logn), delete records by device O(n). End-of-day/end-of-month compression is probably also easiest by device.

It's so little data per device that you wouldn't have to partition again. If I had to partition again, I'd probably do it by month, because "how many 5-minutes since midnight," takes 9 bits (if only they had said every 6 minutes!), so I could prepend "day of month" (5 bits) without exceeding 16 bits.

Mr. TA · 2022-04-20 Reply Admin

Few problems here. One, their choice of storing it as text, and not binary. Two, placing this "condensed" storage in the same database; if I'm charitable, this belongs in some archive or reporting database, not the transactional "live" database. Three, I'm sure Oracle supports table partitioning (I don't have experience with it, but I do with MSSQL) and it's extremely effective at solving these types of challenges. Four, it's a reminder that choosing between single-tenant and multi-tenant database models is VERY important; I'm certain this shop should use single-tenant model, which would greatly reduce the load on the database - you're essentially partitioning by client. Five, to the extent that older records can be archived, this is a prime example for file-based (or cloud blob) storage as some CSV format; it's still readily accessible, very easy to UNION with live data, the performance is similar, and database costs are greatly reduced.

Addendum 2022-04-20 10:00: PS. To those who downplay the # of rows generated by logging systems like this - I currently work for a company which has a similar situation and it quickly gets out of hand with unmaintainable databases, hitting different types of limits, etc. So yes, it's a real problem; but no, they didn't solve it correctly.

2022-04-20 Reply Admin

That there is the thing here.

Apart from the lunacy of having a temporary table hanging around, just to condense everything into a CLOB (if I have the original insane design right), I can't really see a problem with updating the location info for a single id every five minutes. Yes, it should be done in a transaction. (Big deal!) But no, it's not like updating a table with counts of a resource, such as (say) an inventory system. Presumably there will never be a case where you are flooded with updates from the same id, so the supposed "expense" of a transaction boils down to the "expense" of using a relational database -- ie, the transaction itself.

The sad thing is, you could actually model the basic requirements off on the side with a small table, an injection system, and a stopwatch. If you did that, I'm pretty sure you could get down to updating every id in the set of ids every tenth of a second or so, and nobody would notice the supposed "load."

Funny how DB managers (I'm talking CTO types here) never ask for that very simple input.

2022-04-20 Reply Admin

Yup, I'd do with with an archiving system. Depends on requirements, of course.

However, that's basically how global credit card acquisition and reconciliation works. 99% of the time, you're only interested in the last day of two of transactions. For everything else, it's archived in (your example: csv) and pulled back for a JOIN.

Not only is this not rocket science: it practically predates rocket science.

2022-04-20 Reply Admin

@Mr. TA: OP here. You are right that even a billion rows can get out of hand in a table where the initial design didn't take scaling into account. I believe that that sharding was considered, but the oracle licensing costs would have been too expensive. I do agree that having CSV for "archive" data would be a better design, but I think that they wanted to keep the ability to run a single stored procedure to generate reports, etc.

@Sole Purpose of Visit: Couldn't just have a single row that gets updated, it was quite important to save the full history (and have the past few days quickly accessible).

2022-04-20 Reply Admin

Sounds like ELD devices. I suppose it's a big difference whether it's a regional carrier with maybe a couple dozen trucks or an absolute behemoth.

2022-04-20 Reply Admin

Oh man... I worked on a project that used a SQL database as a key/value store. Basically each key was a GUID and the value was a binary blob, that when deserialized could yield additional GUIDs to search for. It was all in a single table. Each row could represent one of many possible entities. There were 100s of thousands of rows in the table.

When I asked "why? and what's the schema definition?" the sage answer I got was that:

It gets less experienced developers used to using a SQL database, and
A schema is inflexible and makes things hard to change later on. Just 'version' the serialization/deserialization code and it will magically work out.

Suffice it to say, the project had a less than impressive outcome

2022-04-20 Reply Admin

One thing I haven't seen mentioned and I would assume is why they didn't use partitioning. Oracle. In the past (IDK about now probably still the same) you had to buy an Enterprise license to get features like partitioning, I would assume they wanted to avoid that license fee, hence they didn't have access to partitioning. Basically it's all Oracles fault as usual.

TheCPUWizard · 2022-04-21 Reply Admin

"If you are using a database as source control, it sounds more like you lost all control." - Name any source control that is NOT some type of database <evil grin>

2022-04-21 Reply Admin

Fair enough.

But that's where you don't do everything in a stored procedure. Stored procedures (particularly in Oracle or MS) are a bit of a bodge.

You want a performant way to deal with a huge wodge of history? Write the damn thing at the application level. Even PHP is good enough for this.

Jeremy Pereira · 2022-04-23 Reply Admin

I once worked in a company that had a largish system that was implemented almost entirely in stored procedures. It was a pain in the arse to debug. I asked the product manager why it was done like that and his genuine reason was "it means we can add new features without going through change control".

2022-04-27 Reply Admin

Going to the same pain at the moment. Stored procedures everywhere, thousands and thousands of lines, half of them unused, or consisting of unreachable code. Can't have debug rights, even in dev environment because "security", it's insane. At "least" we are using liquibase to manage changes, but every change is a full overwrite of the old packages after which those changes are deleted so you have absolutely no idea what functional changes have been done. Absolutely brilliant!

2022-04-28 Reply Admin

I can't believe nobody made the obvious solution: create 288 columns, one for every 5-minute interval in a day

Optimized Database Access Patterns for Dummies

Leave a comment on “Optimized Database Access Patterns for Dummies”