- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
It occurs to me that nobody has mentioned the WTF in the writing (not the WTF in the subject of the writing, but in the writing itself).
The article suggests strongly that Perl is a write-only language. This assertion is ridiculous. Anybody with an ounce of knowledge knows that while Perl isn't easy to read, its illegibility is a mere grain of sand in the path of knowledge when compared to the Himalayan heights of the illegibility of APL.
Of course, APL code is all Greek to me... (APL's name is also the the lamest TLA I've encountered so far.)
Admin
http://en.wikipedia.org/wiki/APL_%28programming_language%29
Really? Could they honestly not come up with anythinng better than that?
Admin
TBH, that query sucks. You need to level-up in SQL a couple more times. It's written with the procedural-programming mindset, and the performance is suboptimal.
The problem?
You get my point? The SQL engine has to disentangle that query by executing each of the sub-selects in sequential order, innermost first. I don't think the row estimator works that great with that pattern either. (It may be more intelligent nowadays; I've only experience with Postgres 8.4)
You're supposed to write each of those as joins. If it's a 1:n relation, it might be tricky, but n:1 is trivial. Something like this:
Of course, that's likely to give extra/duplicate-ish rows if we're dealing with 1:n relations, thanks to group theory. Whether that's a problem, I don't know. It depends on the code.
But anyway, the query looks a lot nicer to me when written as joins.
Filed under: Armchair SQL non-expert, I'm probably in the wrong thread for this anyway
Admin
And then in computer studies, the teacher asks someone to evaluate
0.0×log(sin(0))
.Admin
Uh-huh. Let me know how that works out for you.
Meanwhile, with a real dataset, your suggested query runs slower than mine - not that the actual time taken is a problem since it's less than a second (and the entire query takes less than 2) for a query that's run at most once a minute...
Your version first, followed by mine:
And a repeat of both in case you might think caching might be an issue:
Admin
Can I gets an
EXPLAIN ANALYZE
for both queries?Admin
Was working on it when you posted... :laughing:
Admin
That is curious. Three seq scans. I expected only one. You sure you aren't missing an index on
group_users (user_id)
orgroup_users (group_id)
? You seem to have a composite index ongroup_users (group_id, user_id)
which can't be used foruser_id
Yours is using that composite index, mine isn't.
Anyway, this is mostly academic since the
LIKE '%UUID%'
is the real I/O-eater in those queries.Admin
I'm not missing anything - this isn't my database :laughing: - it's the one this forum runs on and I have no control over the schema. But since you ask..
It surprised me how fast that bit of the query ran when I first started experimenting with the concept.Admin
Continuing with the pointlessly academic - I added two indices for the columns you mentioned (I'm running off a restored backup - not the live DB.) I'm not convinced it's helped:
Admin
This kinda makes me want to try and rewrite some of the queries in my dev DB with subqueries rather than joins, just to see if there's a noticeable difference.
Admin
How long do they take to run? Sub-second? Seconds? Minutes?
Admin
The plan confuses me.
posts
is a massive table, right? Postgres should be intelligent enough to only scan the sixteen posts it finds via the join (granted, it thinks that it'll find 226 posts); instead it's seq scanning the whole table.The
groups_users
andgroups
tables seem pretty small (gu
having a few thousand rows,groups
fewer than a hundred?), neither really benefitting from an index.Is it missing an index on
posts (user_id)
then? That's the only reason I can think of right now.Admin
Sub-second. Not really looking for a performance boost at this point, I'm just curious.
Admin
Nope - there's one there...
Admin
There's your issue. It's a partial index. Pg can't use that.
Try adding
p.deleted_at IS NULL
to both queries next to theLIKE
and see what happens.Admin
groups_users contains stuff like UserX is TL0, TL1 and TL2. For example, for me:
Admin
Admin
Seriously, what the hell is wrong with that query then. It's still not using the index, even though I'm sure it should. The runtime should then be well under 50 milliseconds when it does.
If you're willing to make the
posts (user_id)
index non-partial on your test server, try it.You probably shouldn't even be making partial indices unless more than, say, 50 per cent of your rows are excluded by the condition. (I sure hope the users on this forum aren't deleting half of their posts.)
Admin
I'm presuming there's no issue with having both - I didn't remove the other one..
4.5%. 1 in every 23 posts still seems excessive...
Admin
Most of them are probably Paula every morning. But, yeah...that's more than I'd expect these days.
Admin
Hmm - I don't remember deleting so many...
Admin
:wtf:
Admin
No, shouldn't be an issue.
This is beyond me. I have no further ideas. You can perhaps try the pgsql-performance mailing list if you're interested. I'd be very interested in knowing why it's seqscanning the posts table even though it's got everything it needs to not to.
There's always the nuclear option of
SET enable_seqscan TO off;
which disables it for the session. Shirley the EXPLAINs should then use the index?Admin
Well as has been pointed out, for a whole query that takes < 2 seconds and runs at most once a minute, the whole issue is academic..
I'm not too worried about it.
Admin
I noticed you left the
p.deleted_at IS NULL
in the queries after you created the non-partial index; perhaps it's affecting the plan.Admin
Not a lot of difference - timings are comparable:
Admin
Try the
SET enable_seqscan TO off
andEXPLAIN
the queries. If it's still not using the index onposts
, there's something I'm missing.Admin
Well it's halved the time taken for your suggested query (213.444ms). And even slightly better for mine(202.261ms).
I won't pretend to even remotely totally understand the query plans - reading up on that now...
Admin
That would be your cue to put regular expressions away and get a real parser.
That's just one more reason to stay away from Facebook.
Admin
0
2 4 6 4 8 16 8 10 24
those three numbers follow a pattern rule. what is the rule?
Admin
They look okay now. The posts table part of the query is now using the index, as it should.
Postgres thinks the nested loop variant takes 100k pagefetches [1] while the seqscan on
posts
takes only 24k. But it's not telling me why. The cost estimate just jumps to 100k ; its direct sub-items have costs of 286-ish and 441-ish....Why is it using the nested loop strategy? Postgres usually prefers hash joins, doesn't it?
(Answering to self: maybe there are too few rows in the intermediate resultset for a hash join to be beneficial, or maybe the resultset isn't wide enough. You can disable the relevant
enable_*
,enable_nestloop
in this case, and see what the planner thinks as a backup plan and how much it costs...)Anyway, the thing I was missing is now obvious, I think, was that you're searching all posts by yourself. Postgres says this will be
[the amount of rows returned from g_u] * [average amount of posts per user]
As it thinks it'll have to find 241 users' information (In reality:16 ids, which I think are the same?
user_id = 20
?). So 241 users * avg_postcount means it should have to do theposts (user_id)
index look-up a sizeable percentage of the rows...On second thought, that can't be it. It's looking up fewer than a thousand rows. Half a per cent of the whole table. Perfectly reasonable to use an index then. So scrap this.
[1]: Well, not really 'pagefetches'; it's measured on an arbitrary scale used by the planner. 'Planner unit cost' perhaps?
Admin
Each row is the first number times the column number, but the middle column is being expressed in hexadecimal instead of decimal.
:stuck_out_tongue:
Admin
........ ok that does match, but no.....
also i really shouldn't play guessing games here so......
https://www.youtube.com/watch?v=vKA4w2O61Xo
Admin
Can't watch it from work as I don't have headphones. I'll have to watch it later.
Although I will say I tried a few different things when doing this, but either the 10 or the 16/24 seemed to block the various things I tried. So I figured I'd go with the tongue-in-cheek guess. :smiley:
Admin
It was for this experiment: http://what.thedailywtf.com/t/high-unread-count/3597
@SignatureGuy was able to successfuly generate around 3000 posts but was only able to remove 150 of them before running into some DiscoLimits.
Admin
I agree with whatever @VinDuv posted just above.<t7649p89>
Admin
You posted them.
For the database discussion, isn't this why you can move posts?
Admin
I can move posts?
News to me!
Admin
heh
Admin
We did, but then the idiocracy dragged us back again.
Admin
I would wonder if it's the join order that's confusing the query engine. The nested IN statements kind of controls the JOIN order, and the subquery WHERE clauses act like JOIN conditions, while the alternative query does all the JOINs then filters with the WHERE clause.
Or even:
I also wonder how this performs, as EXISTS can outperform IN, although it's only usual to do so when the subquery is very large:
Theoretically, of course, all these queries should have the same query plan if the query planner is smart enough.
Admin
Regexes are worth taking the time to understand. It's not some innate ability that people who read them have, its simply taking the time to learn them.
Yes, they can be abused (they really shouldn't be hundreds of characters long). If it fits on a line, any professional should be able to figure it out.
I suppose I translate things in my mind too much, but I see "COMPLICATED BAD. GROG NOT UNDERSTAND. GROG ANGRY." in your post.
Admin
Maybe they are, I'm not denying it. But the point is, they wouldn't take nearly as much time to understand if somebody didn't seriously, badly, absolutely fucked up the syntax.
Complicated is bad. If you can have the same thing, but simpler, you'd obviously choose that over the more complicated solution.
Admin
I agree. I have always avoided regexes if there was a simpler, more understandable way of doing things. But then I have never felt the need to obscure my code as an exercise in job protection.
One of my least favorite programmers used to use regexes to validate input numeric dates. In Java. When it was pointed out to him that February has 28 or 29 days depending on the yearnumber, he eventually produced an eyewatering regex to handle it (it will fail in 2100). Mind you, he was also one of those fedora wearing people who believe that Exceptions are bad, women can't program, and customers are always wrong.
Admin
They're good for what they're designed to do: describing matchers for Regular Languages. It's when people use them to do other things that the WTFs mount up.
Thus, if you were building a date parser from scratch, you'd use REs as one of the first steps, to extract the interesting fields (tokenizers are big RE use cases). It wouldn't tell you whether the date was valid, but it would let you get rid of a lot of obviously invalid stuff rapidly. Of course, most languages have a library (possibly built-in) that handles all that stuff for you and it would indicate a WTF to not use it — though which WTF it indicates isn't always simple — but that library call may well have been designed to use REs internally.
Admin
Exactly. And that library has been produced by more than one person, thoroughly reviewed and tested. So this is one wheel nobody should be reinventing to show off their 1337 regex skills. (Deliberate use of obsolescent slang for emphasis.)
Admin
FTFY 0.45% (inb4 WTF is ‰)
Admin
Well, 2 out of 3 correct is not that bad. </troll> <Also I'll let you guess my thoughts on which one of those three is wrong.>
Admin
It's a memorial to the brave souls who attempt to clean up this purposefully obfuscated mess.