The Daily WTF: Curious Perversions in Information Technology

2023-09-14 Reply Admin

INSERT INTO WTF (id,comment) SELECT 42,'FIRST' WHERE NOT EXISTS (SELECT 1 from WTF WHERE id=42);

PJH · 2023-09-14 Reply Admin

Presumably upsert wasn't available...

jeremypnet · 2023-09-14 Reply Admin

When I glanced at the code and saw XXX I thought it was going to substitute that with a table name from a parameter or something.

I don't really understand how one select on an empty table could be such a performance hit though, unless there is some sort of locking conflict or something.

dkf · 2023-09-14 Reply Admin

Two implicit transactions . A "great" way to increase the cost of the DB.

2023-09-14 Reply Admin

It feels like we're missing key details here. I could envision this being an issue if, say, the method were being called in a loop where you were inserting a significant number of records. Couple that with a missing index on ID so that the DB has to go RBAR for every SELECT and you've got a disaster. I suspect something like that may be the actual scenario.

Cidolfas · 2023-09-14 Reply Admin

Most databases today have upsert in some capacity (ON DUPLICATE KEY or ON CONFLICT). Also surprised there wasn't any attempt at batching the updates. Doing 10,000 individual queries is so much worse than doing 10 queries of 1,000 each.

2023-09-14 Reply Admin

Honestly, I'd keep the whole SELECT shebang in principle, just as defensive programming in case the DELETE beforehand didn't run. But yes, batching the INSERTs to do multiple rows at once, ideally within an transaction to speed up index updates a bit, would help a lot.

Also, if it really ran in a loop, wouldn't the double re-assignment of the query variable in each loop cost tons of memory? Compared to creating 2 separate variables and updating the parameters on those.

2023-09-14 Reply Admin

Okay, so help me understand something. The function before this empties the table. Then this does a lookup, on the now-empty table. That should be negligible time - so why is the performance taking a hit?

2023-09-14 Reply Admin

This is probably only one step in the batch update (otherwise, why call it a "batch" update?). So this function gets called many, many times during the update. As the table grows, the select costs more and more.

2023-09-14 Reply Admin

I agree about likely bogging down when there is no index on ID, coupled with a large number of rows.

One possible explanation of the style is that the source data contains duplicate IDs, or it needed to work with a non-empty table in other cases.

WatersOfOblivion · 2023-09-14 Reply Admin

Okay, so help me understand something. The function before this empties the table. Then this does a lookup, on the now-empty table. That should be negligible time - so why is the performance taking a hit?

I'm reading between the lines a bit here, but I think the key is that this is a batch updater. I'm guessing the larger context is that it purges the table, then calls this in a loop with a bunch of items.

If that's the case, you are unnecessarily checking for dupes (unless there's dupes in your input data). You are unnecessarily creating and tearing down connections. But it's also conceptually O(n^2): the SELECT has to search through all the items in the table, so if you're inserting 1000 items, you're doing 1000 queries over a table of size 0, 1, 2, ... 998, 999. (Conceptually quadratic, at least. Databases use BTrees, so it's more like O(n log n) in reality.)

2023-09-14 Reply Admin

Definitely something else going on there, a SELECT + INSERT shouldn't be massively slower than a straight insert especially on a table starting empty.

2023-09-15 Reply Admin

The real WTF is of course the BASIC fans creating the SQL dialect in the first case without a standard MERGE INSERT, MERGE UPDATE, etc.

2023-09-15 Reply Admin

MERGE

2023-09-15 Reply Admin

Also, I would hope that id is the primary key and has a clustered index on it. If, by any chance, it has no index then running a select where id = ... could be quite a slow process once the table size starts to grow.

2023-09-15 Reply Admin

This doesn't look any worse than the sort of thing ORM tools throw at a database. Far from ideal, but it's using a pooled connection, and apparently only running the select on an empty table anyway - otherwise I might be bothered by the select *, which would cause a lookup to the full data-page rather than just the PK index which presumably exists on Id. If removing that select improves anything significantly, I'd be very suspicious of the network latency between the app and the dB!

2023-09-15 Reply Admin

.. oh yes, my mistake - I just noticed the table is getting filled between selects. Getting rid of the select is still a bad idea. I'd replace the * with "Id".. actually I'd replace the whole thing with a dump to a temp table and a merge, but getting rid of the * is better than getting rid of the select.

2023-09-15 Reply Admin

Presumably the "id" field is the primary key, so isn't the lookup essentially O(1)? Is the WTF that they didn't make the ID the primary key and forgot add an index?

WatersOfOblivion · 2023-09-18 Reply Admin

Also, I would hope that id is the primary key and has a clustered index on it. If, by any chance, it has no index then running a select where id = ... could be quite a slow process once the table size starts to grow

I'd assume id is the PK, and every database I've worked with forces an index on the primary key. But indexes increase search performance at the cost of create performance because you're technically writing twice on each insert: once to the table and once to the index.. If you're doing a bunch of individual inserts, that price is going to pile up., especially since write time tends to dwarf read times. Again, assuming that we're doing this in a loop.

Presumably the "id" field is the primary key, so isn't the lookup essentially O(1)? Is the WTF that they didn't make the ID the primary key and forgot add an index?

In a BTree that databases use to store data on disk, a key lookup is O(log n), where n is the number of items in the table, but the base of the log is the sector size (usually 4k) divided by the size of the primary key (let's say 8 bytes for a 64-bit int.), so the base is usually 512. Not constant, but low. The problem is that it's O(log n) disk seeks, which are sloooow. Writes are even worse, because you have to wait for all reads to finish and block new ones to get the lock before you can the perform O(log n) writes. And again, if you're writing a row + PK index, that's a two table lock and 2 * O(log n) writes.

Of course, there's caching which can have a O(1) lookup time. And there's MVCC. Oh, and the writes are put in a write-ahead log, so it's technically O(1) writes and (hopefully) zero disk seeks and the O(log n) writes get deferred. It gets ... complicated.

2023-09-20 Reply Admin

Interesting that no-one has identified not reusing the prepared statements for each iteration of the loop (assuming that is what happens), but pointlessly redoing that exact same non trivial work each time, as a contributing factor.

Select Start

Leave a comment on “Select Start”