- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Prmtr optzn != ^
Admin
While we all know that premature optimisation is bad, it is rarely acknowledged that premature pessimisation is even worse.
Admin
Ah yes, the old storing relational data in a non-relational way in a relational database, for Reasons.
Besides, one record per device every 5 minutes, is only 288 records per device per day. Unless we're talking billions of devices, I fail to see the problem.
Admin
Currently cow-orking with a colleague who is dumping a line of debug info, once a day at around midnight, into a whole bunch of text files which are periodically harvested by a program I am developing.
Trouble is, we can't just grab copies of these files programatically according to our monitoring schedule, because he is holding them open. We have to jump through the irksome loop of taking a backup of each one, using a manual process that requires extra privileges.
I made the suggestion that he just open the file for append, write his line of data, then closed it again. He disagreed, on the grounds that the overhead of opening and closing a file would be an optimisation hit. He seems deaf to arguments that a) if he's got the file constantly open for write, the apps that need to access that file on a regular basis can't, and that b) holding files open 24/7 which you aren't doing anything with hogs resources.
Bah. Youngsters. Geroff me lawn.
Admin
If you are using a database as source control, it sounds more like you lost all control.
Admin
Maybe this is the database that tells us how many devices run Java.
Admin
I got very similar experience in one of my previos contracts. Just the reason was different - Int overflow. You see, if you log location data for a lot of devices with high precision, and you have primary key INT32 for the position log, you got a problem. I tried propose either changing it to INT64, or using DeviceID + Timestamp instead - none of the proposals was accepted, but they created very same solution mentioned in the WTF (Only they thought about optimization, so they used binary with fixed length time/lat/lon records). It took half a year for two people to develop and test it all around. Reason why they choose this approach - making PK of INT64 (Or DeviceID + Timestamp) would consume too much storage ...
Admin
Missing requirement: the resolution of position fix required! They've already scrapped altitude, which is probably a fair compromise. I'd think that using some coordinate system (UTM/OLC?) other than lat/lon would be quite useful. I'd likely occasionally have a coarse position update and subsequent fine position updates, including some very compact way of saying "it hasn't moved," without worrying that degrees represent different absolute sizes depending where you are on the globe. Presumably they'd also want to keep track of "no position fix available" and "data deleted," which are also very compact.
105,120 records per device per year - not that bad. If you have 1M devices (I mean, we are talking about the famous Initech, right??), you're looking at 105 billion records per year. Good news is that you can easily store in a sorted order per device, so I would probably just partition by device, which also makes it really fast/easy to: load balance by hash across a cluster O(1), recall records by device O(logn), delete records by device O(n). End-of-day/end-of-month compression is probably also easiest by device.
It's so little data per device that you wouldn't have to partition again. If I had to partition again, I'd probably do it by month, because "how many 5-minutes since midnight," takes 9 bits (if only they had said every 6 minutes!), so I could prepend "day of month" (5 bits) without exceeding 16 bits.
Admin
Few problems here. One, their choice of storing it as text, and not binary. Two, placing this "condensed" storage in the same database; if I'm charitable, this belongs in some archive or reporting database, not the transactional "live" database. Three, I'm sure Oracle supports table partitioning (I don't have experience with it, but I do with MSSQL) and it's extremely effective at solving these types of challenges. Four, it's a reminder that choosing between single-tenant and multi-tenant database models is VERY important; I'm certain this shop should use single-tenant model, which would greatly reduce the load on the database - you're essentially partitioning by client. Five, to the extent that older records can be archived, this is a prime example for file-based (or cloud blob) storage as some CSV format; it's still readily accessible, very easy to UNION with live data, the performance is similar, and database costs are greatly reduced.
Addendum 2022-04-20 10:00: PS. To those who downplay the # of rows generated by logging systems like this - I currently work for a company which has a similar situation and it quickly gets out of hand with unmaintainable databases, hitting different types of limits, etc. So yes, it's a real problem; but no, they didn't solve it correctly.
Admin
That there is the thing here.
Apart from the lunacy of having a temporary table hanging around, just to condense everything into a CLOB (if I have the original insane design right), I can't really see a problem with updating the location info for a single id every five minutes. Yes, it should be done in a transaction. (Big deal!) But no, it's not like updating a table with counts of a resource, such as (say) an inventory system. Presumably there will never be a case where you are flooded with updates from the same id, so the supposed "expense" of a transaction boils down to the "expense" of using a relational database -- ie, the transaction itself.
The sad thing is, you could actually model the basic requirements off on the side with a small table, an injection system, and a stopwatch. If you did that, I'm pretty sure you could get down to updating every id in the set of ids every tenth of a second or so, and nobody would notice the supposed "load."
Funny how DB managers (I'm talking CTO types here) never ask for that very simple input.
Admin
Yup, I'd do with with an archiving system. Depends on requirements, of course.
However, that's basically how global credit card acquisition and reconciliation works. 99% of the time, you're only interested in the last day of two of transactions. For everything else, it's archived in (your example: csv) and pulled back for a JOIN.
Not only is this not rocket science: it practically predates rocket science.
Admin
@Mr. TA: OP here. You are right that even a billion rows can get out of hand in a table where the initial design didn't take scaling into account. I believe that that sharding was considered, but the oracle licensing costs would have been too expensive. I do agree that having CSV for "archive" data would be a better design, but I think that they wanted to keep the ability to run a single stored procedure to generate reports, etc.
@Sole Purpose of Visit: Couldn't just have a single row that gets updated, it was quite important to save the full history (and have the past few days quickly accessible).
Admin
Sounds like ELD devices. I suppose it's a big difference whether it's a regional carrier with maybe a couple dozen trucks or an absolute behemoth.
Admin
Fair enough.
But that's where you don't do everything in a stored procedure. Stored procedures (particularly in Oracle or MS) are a bit of a bodge.
You want a performant way to deal with a huge wodge of history? Write the damn thing at the application level. Even PHP is good enough for this.
Admin
I once worked in a company that had a largish system that was implemented almost entirely in stored procedures. It was a pain in the arse to debug. I asked the product manager why it was done like that and his genuine reason was "it means we can add new features without going through change control".
Admin
I can't believe nobody made the obvious solution: create 288 columns, one for every 5-minute interval in a day