The Daily WTF: Curious Perversions in Information Technology

2006-02-01 Reply Admin

select col1, col2, col3 from a, b, c, d where a.c4 = b.c4 and b.c4 = c.c4;

In Belgium, a C4 is -besides a Citroen- a document to request unemployment. Seems right in place here!

mcrbids · 2006-02-01 Reply Admin

I'm working on an abstration layer that allows for direct querying of the database for information on foreign keys.

Thus, let's say that you have the following tables:

table customers
id serial
name varchar,
address varchar,
... etc...

table invoices
id serial,
customers_id integer REFERENCES customers(id),
date date,
... etc ...

table purchased
invoices_id integer REFERENCES invoices(id),
item_id integer REFERENCES items(id),
... etc ...

By merely specifying "purchased" and "customer" a query would be built, following the foreign keys, so that you have a distinct list of customers and the items they purchased.

This would eliminate cartesian joins in table sets properly constrained by foreign keys. This is being done by querying the database metainformation tables directly, and should work for Oracle/Postgres/whatever. (tested in Postgres 8.1)

Since I've not seen this elsewhere, I'd be very curious what you cynical DailyWTF''ers might have to say about this. (naysaying will be ignored; comparisons to existing technology will be carefully considered)

2006-02-01 Reply Admin

"By the way, Sybase really needs the SQLServer "join" syntax so where clauses can be less WTF-ery in and of themselves. "

More recent versions of Sybase (certainly from 12.5 onwards, possibly eralirer, can't remember) do indeed allow ANSI JOIN syntax.

2006-02-01 Reply Admin

I think you need ORACLE ;-)

2006-02-01 Reply Admin

how about actually doing the inner joins?

Select col1, col2, col 3
from a inner join b on a.c4 = b.c4
inner join c on a.c4 = c.c4
inner join d on a.c4 = d.c4
?
I don't know maybe it's just me but actually have a real join as opposed to the where clauses helps

So what's the difference?

What does "a real join" give you that the where clause that joins the two tables doesn't?

Aside from being verbose to the point of being cobolesque, and not backwards compatible, that is.

jwenting · 2006-02-01 Reply Admin

Anonymous:

select col1, col2, col3 from a, b, c, d where a.c4 = b.c4 and b.c4 = c.c4;

In Belgium, a C4 is -besides a Citroen- a document to request unemployment. Seems right in place here!

And in all of NATO it's a kinda powerful plastic explosive, ideal for blowing up servers ;)

jwenting · 2006-02-01 Reply Admin

Ah, "performance" upgrade through adding hardware.
Once was asked to profile a C application that needed to perform some calculations based on an input file.
This application was meant to run once a day. Initially this had been no problem because the input wasn't that large.

Over time the input increased to the point that the application now needed 36 hours to complete its 24 hour mission, and some alarm bells started ringing.

On diving into the source I found an application written by someone who'd obviously heard the rule to replace everything that's done more than once in a program by a function.

So was there the function "int add(int a, int b) {return a+b;}" which was of course not inlined.
They'd also heard about loops and how they make code better. What they'd not heard about is that looping from 0 to 1 makes the program slower. So there were dozens of nested loops, all running over extremely short ranges.

Unrolling a few of those inner loops and inlining those dumb functions was enough to increase the performance by over 50%, at which point the customer was happy.

2006-02-01 Reply Admin

Anonymous:
BlackTigerX:
Anonymous:

I've seen this type of stuff before.

This is simplifed of course.

select col1, col2, col3 from a, b, c, d where a.c4 = b.c4 and b.c4 = c.c4;

Off course they got back duplicate rows, so the programmer changed it to be

select unique col1, col2, col3 from a, b, c, d where a.c4 = b.c4 and b.c4 = c.c4;

All was fine when system was very small, but as data got added the report that had this query ran slower, and slower, and slower. This is how the SQL was fixed.

select col1, col2, col3 from a, b, c, d where a.c4 = b.c4 and b.c4 = c.c4 and c.c4 = d.c4;

aaaaarrrrghhh... the googles!!! they do nothing!!

why is people using "where" to join?

Because you cannot use inner joins (or outer joins for that matter) with some DBs. Oracle 8 anyone?

You also have databases like MySql which perform much better when you use "where" instead of "join".

2006-02-01 Reply Admin

Anonymous:

morrons!

I think my irony gland just exploded.

Jeff S · 2006-02-01 Reply Admin

Maurits:
Jeff S:
I hope you can see that those are at least 2 separate reports, right? Either you are returning all things, and which container they happen to be in (1 + 3) or you are returning all containers, including any things that might be in them (2 + 3). No logical result combines all 3 ! Surely you can see this?

That seems to be one of your postulates, that all "logical results" have a single core table which drives the query, and that the other joined tables are satellites.

Operating from this postulate, your proposition (that RIGHT OUTER JOINS and FULL OUTER JOINS are symptoms of a mistake) does, in fact, follow.

My "pshaw" then is directed not at the arguments you use to go from your postulate to your proposition, but rather at the postulate itself.

Nope. I never said anything about 1 single table driving the query. I mentioned two other great substitutes for full joins earlier on (hint: neither requires 1 primary table), and I also stated my "postulate": returning logical, non-randomly-null results from an at least somewhat normalized schema.

Jeff S · 2006-02-01 Reply Admin

Anonymous:
how about actually doing the inner joins?

Select col1, col2, col 3
from a inner join b on a.c4 = b.c4
inner join c on a.c4 = c.c4
inner join d on a.c4 = d.c4
?
I don't know maybe it's just me but actually have a real join as opposed to the where clauses helps

So what's the difference?

What does "a real join" give you that the where clause that joins the two tables doesn't?

Aside from being verbose to the point of being cobolesque, and not backwards compatible, that is.

"real joins" give you:

1) a much clearer, more logical SELECT with
a) a clear distinction between the join condition and the criteria
b) a clear, concise FROM clause to indicate your driving recordset (note I did not say "table") which is the most important part of any SELECT with joins

2) The ability to clearly express OUTER joins, and join order precendence (though this is rarely necessary, it can be handy)

3) The ability to add criteria to the join expression itself (technically only needed for outer joins, but it can make inner joins more clear as well):

SELECT ..
FROM A
LEFT OUTER JOIN B
ON A.ID = B.ID AND B.Status = "Active"

4) the ability to see explicitly when a CROSS JOIN is being done on purpose, as opposed to leaving yourself to manually read the entire WHERE clause to check to see if all tables are joined properly and wondering if the cartesian product is intential or not.

5) "Backwards-compatable"? uh, I guess not. you are correct. that is why I never use C#, because it is not backwards compatable with C, or CSS because it is not backwards compatable with netscape 2.0.

Finally,

6) It is ANSI standard to use join syntax. that means it will be FORWARD compatable. I will let you decide for yourself which is more important, to be forward-compatable or backwards compatable.

2006-02-01 Reply Admin

In contrast to the oracle style (+) outer joins, left outer joins allow you to join tables using compound keys:

SELECT * 
FROM a LEFT JOIN b ON a.key1 = b.key1 AND a.key2 = b.key2

is not equivalent to

SELECT * 
FROM a, b 
WHERE a.key1 = b.key1(+) AND a.key2 = b.key2(+)

johnl · 2006-02-01 Reply Admin

I read "consultant"'s (the non-registered guy who just called himself "consultant" at the top of page 2) post with interest. I've always said that software development is split into two broad segments: the business side (selling your services/products to a customer, understanding what the customer wants, etc) and the technical side (all development work, including feasibility studies and design as well as the actual programming, testing and other related tasks). The reason consultants so often fail is because they try to do both at once.

In particular, consultant said that the most important thing about consulting is not the technical skills, it's the business side - working out what the customer wants, and then selling it to him. The problem here is that there are considerable technical skills involved in working out whether what the customer wants is feasible, how long it'll take, what the costs are and so on. In this case, the consultant (the one in the WTF) had an idea for fast, simple reports to be run every night, and promptly promised that to the customers. However, he lacked the technical skills to realise that his idea wouldn't work, or perhaps he did realise, but only after he'd made the promise.

We've had similar issues where I work. Sales used to be allowed (encouraged, even) to offer extra features not in the main product to customers if it would gain a large sale, and there were a lot of times when development would turn around and say "we can't do that" or "that's 6 months work, and you've promised it for next week". Of course, a similar thing can work across programming disciplines. Our Delphi developers would assume that the installation would do xyz, and would be surprised when it didn't.

ammoQ · 2006-02-01 Reply Admin

Jeff S:
Anonymous:
how about actually doing the inner joins?

Select col1, col2, col 3
from a inner join b on a.c4 = b.c4
inner join c on a.c4 = c.c4
inner join d on a.c4 = d.c4
?
I don't know maybe it's just me but actually have a real join as opposed to the where clauses helps

So what's the difference?

What does "a real join" give you that the where clause that joins the two tables doesn't?

Aside from being verbose to the point of being cobolesque, and not backwards compatible, that is.

"real joins" give you:

1) a much clearer, more logical SELECT with
a) a clear distinction between the join condition and the criteria
b) a clear, concise FROM clause to indicate your driving recordset (note I did not say "table") which is the most important part of any SELECT with joins

2) The ability to clearly express OUTER joins, and join order precendence (though this is rarely necessary, it can be handy)

3) The ability to add criteria to the join expression itself (technically only needed for outer joins, but it can make inner joins more clear as well):

SELECT ..
FROM A
LEFT OUTER JOIN B
ON A.ID = B.ID AND B.Status = "Active"

4) the ability to see explicitly when a CROSS JOIN is being done on purpose, as opposed to leaving yourself to manually read the entire WHERE clause to check to see if all tables are joined properly and wondering if the cartesian product is intential or not.

5) "Backwards-compatable"? uh, I guess not. you are correct. that is why I never use C#, because it is not backwards compatable with C, or CSS because it is not backwards compatable with netscape 2.0.

Finally,

6) It is ANSI standard to use join syntax. that means it will be FORWARD compatable. I will let you decide for yourself which is more important, to be forward-compatable or backwards compatable.

IMO it's mostly a matter of habits.
As a long-time Oracle users, the "where"-style join is what my coworkers and I are used to.
That said, (+) for outer joins definitely sucks.

JohnO · 2006-02-01 Reply Admin

Jeff S:

JohnO:

Simplified from real-life but captures the essence:

CREATE TABLE Margin (

SnapshotDate datetime not null,

FlowDate datetime not null,

TransactionID int not null,

Amount money not null

)

ALTER TABLE Margin ADD

CONSTRAINT PK_Margin PRIMARY KEY CLUSTERED (SnapshotDate, FlowDate, TransactionID)

GO

DECLARE @BaseLine datetime,

@Current datetime

SELECT @BaseLine = '01/30/2006'

SELECT @Current = '01/31/2006'

SELECT SnapshotDate = COALESCE(BL.SnapshotDate, CT.SnapshotDate),

FlowDate = COALESCE(BL.FlowDate, CT.FlowDate),

TransactionID = COALESCE(BL.TransactionID, CT.TransactionID),

Amount = COALESCE(CT.Amount, 0) - COALESCE(BL.Amount, 0)

FROM ( SELECT * FROM Margin WHERE SnapshotDate= @BaseLine )BL

FULL OUTER JOIN ( SELECT * FROM Margin WHERE SnapshotDate= @Current ) CT

ON CT.SnapshotDate = BL.SnapshotDate

AND CT.FlowDate = BL.FlowDate

AND CT.TransactionID = BL.TransactionID



Can you hook me up with some sample data (INSERT statements if possible) and results based on that data? I cannot make sense of the business logic you are trying to apply with this statement. My first instinct is that no joins at all are needed, just a simple GROUP BY, but it is hard to tell what you are trying to do.

(by the way-- that's another bad attribute of FULL OUTER JOINS -- they make no sense when you try to read the SELECT statement as well. If it turns out that what you are trying to return is actually logical, the correct statement will be much shorter and much more clear in addition to being more efficient).

I am not saying there aren't equivalent ways to do this with temp tables and left joins but a decent query optimizer like SQL 2000 has no problem with this, especially when you have an index to support the join. The query plan is quite good. I don't why you have this unexplained bias against FOJs. FOJ is a natural fit for any variance situation. Apparently you've never encountered variance requirements before or you have failed to grasp the utility of FOJ.

The business logic here is straightforward. You have commodities contracts that are valued over time. Every day, you take a snapshot of their value at a fixed point in time. You want to see why your P&L is up or down from yesterday to today for a given delivery date so you have a variance report that compares two snapshots. Between yesterday and today, I could have made some new purchases or sales, I could have deactivated some existing transactions, or I could have modified the term of some existing transactions because they were entered incorrectly. Hence the need for an FOJ as opposed to a LOJ or an IJ.

There's no group by (at least not a straight-forward one) that achieves this. Perhaps you failed to see the substraction occurring on the amount column?

Another case where I've seen an FOJ used intelligently is for an auditing report that verifies that data has been transferred between two systems via some sort of interface. Things could be in System A and not in System B, things could not be in System A and in System B or they can be in System A and in System B. I am not sure why are a smart guy such as yourself can't see the applicability of an FOJ to these cases. They are real-world situations in which FOJ is a good choice.

I have yet to see you provide any real counter-argument to the use of FOJ in my situation.

Pope · 2006-02-01 Reply Admin

Anonymous:
John Smallberries:
If René Descartes was never born, this would not have been an issue.
Yeah, not to mention that "I think, therefore I am" crap and mind/body dualism. Screw you Descartes.

Fundamentally wrong anyway. Should be:

I think therefore I think I am.

Cogito ergo cogito sum.

SeekerDarksteel · 2006-02-01 Reply Admin

felix:
SeekerDarksteel:

Bah, real men query every table separately then piece it all together in the code!

If that's supposed to be sarcasm, it's misplaced. When such an approach brings execution time down from 15 min. to 10 sec. (that's 90 times!), I'd say it's a perfectly appropriate one.

Cheers,
Felix

P.S. Yes, I speak from experience. No, I'm not very good with SQL. I'm not ashamed of it. I tend to get jobs done - I think it's the only thing that matters.

I intended it to be funny, but that's what I do too. Ive gotten some processes down from 5+ minutes to 30 seconds by making 2 or 3 simple queries instead of 1 complex one. But of course our webserver is probably about 10 times as powerful as our db server so its not unexpected.

2006-02-01 Reply Admin

so, apparently some consultant didn't fully exist

Reminds me of Mike O'Dell's:
I never worry about a factor of 2 performance. Unfortunately, neither do 10 of my friends.

Jeff S · 2006-02-01 Reply Admin

JohnO:
I am not saying there aren't equivalent ways to do this with temp tables and left joins but a decent query optimizer like SQL 2000 has no problem with this, especially when you have an index to support the join. The query plan is quite good. I don't why you have this unexplained bias against FOJs. FOJ is a natural fit for any variance situation. Apparently you've never encountered variance requirements before or you have failed to grasp the utility of FOJ.

The business logic here is straightforward. You have commodities contracts that are valued over time. Every day, you take a snapshot of their value at a fixed point in time. You want to see why your P&L is up or down from yesterday to today for a given delivery date so you have a variance report that compares two snapshots. Between yesterday and today, I could have made some new purchases or sales, I could have deactivated some existing transactions, or I could have modified the term of some existing transactions because they were entered incorrectly. Hence the need for an FOJ as opposed to a LOJ or an IJ.

There's no group by (at least not a straight-forward one) that achieves this. Perhaps you failed to see the substraction occurring on the amount column?

Another case where I've seen an FOJ used intelligently is for an auditing report that verifies that data has been transferred between two systems via some sort of interface. Things could be in System A and not in System B, things could not be in System A and in System B or they can be in System A and in System B. I am not sure why are a smart guy such as yourself can't see the applicability of an FOJ to these cases. They are real-world situations in which FOJ is a good choice.

I have yet to see you provide any real counter-argument to the use of FOJ in my situation.

Sample data and results, please! I don't recall mentioning temp tables. yes, I did see the substraction. I guess I will just have to figure out how to subtract two numbers w/o using a full outer join, let me take care of that.

Jeff S · 2006-02-01 Reply Admin

CREATE TABLE Margin (
SnapshotDate datetime not null,
FlowDate datetime not null,
TransactionID int not null,
Amount money not null


ALTER TABLE Margin ADD

CONSTRAINT PK_Margin PRIMARY KEY CLUSTERED (SnapshotDate, FlowDate, TransactionID)

GO


insert into Margin
select '1/30/2006','1/1/2006',1,10 union all
select '1/30/2006','1/2/2006',2,20 union all
select '1/30/2006','1/3/2006',3,30 union all
select '1/31/2006','1/1/2006',1,10 union all  -- same
select '1/31/2006','1/2/2006',2,21 union all  -- different amount
select '1/31/2006','1/4/2006',3,30 -- different date


DECLARE @BaseLine datetime,
@Current datetime


SELECT @BaseLine = '01/30/2006'
SELECT @Current = '01/31/2006'

SELECT SnapshotDate = COALESCE(BL.SnapshotDate, CT.SnapshotDate),
FlowDate = COALESCE(BL.FlowDate, CT.FlowDate),
TransactionID = COALESCE(BL.TransactionID, CT.TransactionID),
Amount = COALESCE(CT.Amount, 0) - COALESCE(BL.Amount, 0)
FROM ( SELECT * FROM Margin WHERE SnapshotDate= @BaseLine )BL
FULL OUTER JOIN ( SELECT * FROM Margin WHERE SnapshotDate= @Current ) CT
ON CT.SnapshotDate = BL.SnapshotDate
AND CT.FlowDate = BL.FlowDate
AND CT.TransactionID = BL.TransactionID

I gave you some sample data and ran your SELECT. it just returns everything. Did you even try it? that's why I need sample data from YOU and what your expected results should be, since if what you have posted doesn't work, and you won't explain the logic that it *should* be doing, then I am sure how I can help you.

Is this what you meant?

select @baseLine as BaseLine, @Current as [Current], 
    FlowDate, TransactionID,
    sum(case when SnapShotDate = @BaseLine then -1 else 1 end) as Amount
from
   Margin
where
   SnapShotDate in (@BaseLine, @Current)
group by
   FlowDate, transactionID

2006-02-01 Reply Admin

Of course, you could have just changed compilers, too. Most modern ones would do this sort of inlining and loop optimization for you.

e.thermal · 2006-02-01 Reply Admin

Actually today a co-worker of mine, got a call at 2am and worked til 4am because the month end programs were running really slow on some serious hardware. These jobs were running on iSeries hardware so he was able to get a plan on the running job and found that the current plan for the one sql statement was 2000 seconds to execute. What this job did was delete notes from a central notes file so this program got called by about 15 other programs and is intended to execute quickly. No one ever optimized the sql, so my co-worker ripped the sql out and re-wrote it in rpg and re-ran the mothend procedure and the whole thing took 57 seconds to execute, compared to 3+ hours last night.

Ohh BTW this is code from the vendor in a tier 1 ERP solution.

johnl · 2006-02-01 Reply Admin

Unless he ws using a modern compiler, but for whatever reason it just couldn't spot the potential for optimisation. Most compilers will favour predictability over performance, so will only optimise if they're absolutely sure that it won't break anything.

2006-02-01 Reply Admin

I think that there is a typo in your query. You wrote:

select @baseLine as BaseLine, @Current as [Current],
 FlowDate, TransactionID,
 sum(case when SnapShotDate = @BaseLine then -1 else 1 end) as Amount
from
 Margin
where
 SnapShotDate in (@BaseLine, @Current)
group by
FlowDate, transactionID

And I think you meant:

select @baseLine as BaseLine,
 @Current as [Current],
 FlowDate,
 TransactionID,
 sum(Amount*case when SnapShotDate = @BaseLine then -1 else 1 end) as Amount
from Margin
where SnapShotDate in (@BaseLine, @Current)
group by FlowDate, transactionID

Other than that, looks like it should be much more efficent.

Jeff S · 2006-02-01 Reply Admin

Anonymous:
I think that there is a typo in your query. You wrote:

select @baseLine as BaseLine, @Current as [Current],
 FlowDate, TransactionID,
 sum(case when SnapShotDate = @BaseLine then -1 else 1 end) as Amount
from
 Margin
where
 SnapShotDate in (@BaseLine, @Current)
group by
FlowDate, transactionID

And I think you meant:

select @baseLine as BaseLine,
 @Current as [Current],
 FlowDate,
 TransactionID,
 sum(Amount*case when SnapShotDate = @BaseLine then -1 else 1 end) as Amount
from Margin
where SnapShotDate in (@BaseLine, @Current)
group by FlowDate, transactionID

Other than that, looks like it should be much more efficent.

d'oh! thank you for noticing that. He also *might* want this:

select FlowDate, TransactionID, 
    sum(case when SnapShotDate=@BaseLine then Amount else 0 end) as BaseLine,
    sum(case when SnapShotDate=@Current then Amount else 0 end) as [Current],
    sum(Amount * case when SnapShotDate = @BaseLine then -1 else 1 end) as Difference
from
   Margin
where
   SnapShotDate between @BaseLine and @Current
group by FlowDate, transactionID
order by FlowDate, transactionID

Again, though, hard to tell. I suppose, then, his corrected SELECT would be:

SELECT SnapshotDate = COALESCE(BL.SnapshotDate, CT.SnapshotDate),
FlowDate = COALESCE(BL.FlowDate, CT.FlowDate),
TransactionID = COALESCE(BL.TransactionID, CT.TransactionID),
Amount = COALESCE(CT.Amount, 0) - COALESCE(BL.Amount, 0)
FROM ( SELECT * FROM Margin WHERE SnapshotDate= @BaseLine )BL
FULL OUTER JOIN ( SELECT * FROM Margin WHERE SnapshotDate= @Current ) CT
 ON CT.FlowDate = BL.FlowDate
AND CT.TransactionID = BL.TransactionID
order by 2,3

(note: I removed the join on the SnapShotDate columns, and I added the order by, since I hope we are not returning random results)

That is twice as slow as the simple GROUP BY version and quite a bit longer and uglier to boot.

But, again, we are still guessing as to what his FULL JOIN is supposed to be doing.

JohnO -- it's OK to learn something now and then. Sometimes, when someone says "there are better ways", instead of instantly thinking they are stupid and trying to prove them wrong, you should sometimes consider listening to them.

2006-02-01 Reply Admin

Or maybe not... :(

I just checked, and the select without a join both returns an incorrect result set (only returns 19404 rows, should return 38808) and is slower, at least with the sample data that I made up.

In any case this query:

select SnapshotDate,
 Flowdate,
 TransactionID,
 sum(case when [Type] = 'Current' then amount else 0 end) -
 sum(case when [Type] = 'Baseline' then amount else 0 end) as [Amount]
from
(
 select 'Baseline' as [Type],
 *
 from Margin
 where SnapshotDate = @Baseline
union all
 select 'Current' as [Type],
 *
 from Margin
 where SnapshotDate = @Current
) a
group by SnapshotDate,
 Flowdate,
 TransactionID

Produces the correct result set and is faster than the original query. Note that the estimated execution plan shows that it is more expensive than the original query, but server trace results show a shorter execution duration of this query versus the original query. If you are curious, I generated my sample data with the following code:

declare @SnapshotDate datetime,
@FlowDate datetime,
@TransactionID int,
@Amount money

set @SnapshotDate = '1/1/05'
set @Amount = 10000.00

while (@SnapshotDate < '2/1/06')
begin
set @Flowdate = '1/1/05'
while (@Flowdate < '2/1/06')
begin
 set @TransactionID = 1
 while (@TransactionID < 50)
 begin
 insert Margin values (@SnapshotDate, @Flowdate, @TransactionID, @Amount * rand() )
 set @TransactionID = @TransactionID + 1
 end
 set @Flowdate = @Flowdate + 1
end
set @SnapshotDate = @SnapshotDate + 1
end

Giving me a nice large table to work against.

Jeff S · 2006-02-01 Reply Admin

Anonymous:
Or maybe not... :(

I just checked, and the select without a join both returns an incorrect result set (only returns 19404 rows, should return 38808) and is slower, at least with the sample data that I made up.

Actually, I don't think so ... it should return 19404 rows. You should not be grouping on SnapShotDate. The idea (as we are guessing) is to return 1 row per TransactionID/FlowDate and compare the base amount with the current amount.

Look at your results that return 38808 rows -- does it ever compare baseline to current?

JohnO · 2006-02-01 Reply Admin

Use this sample data:

INSERT Margin VALUES ('01/30/2006', '02/01/2006', 1, 50000)
INSERT Margin VALUES ('01/30/2006', '02/01/2006', 2, 150000)
INSERT Margin VALUES ('01/30/2006', '02/01/2006', 3, 200000)
INSERT Margin VALUES ('01/30/2006', '02/01/2006', 4, 600)

INSERT Margin VALUES ('01/31/2006', '02/01/2006', 1, 40000)
INSERT Margin VALUES ('01/31/2006', '02/01/2006', 3, 210000)
INSERT Margin VALUES ('01/31/2006', '02/01/2006', 4, 700)
INSERT Margin VALUES ('01/31/2006', '02/01/2006', 5, 1000)

And remove the snapshotdate clause from the ON, IE it should now be

...CT ON CT.FlowDate = BL.FlowDate
AND CT.TransactionID = BL.TransactionID

Sorry for the bug. I was typing it from memory.

Expected results are:

2006-01-30 00:00:00.000 2006-02-01 00:00:00.000 1 -10000.0000
2006-01-30 00:00:00.000 2006-02-01 00:00:00.000 2 -150000.0000
2006-01-30 00:00:00.000 2006-02-01 00:00:00.000 3 10000.0000
2006-01-30 00:00:00.000 2006-02-01 00:00:00.000 4 100.0000
2006-01-31 00:00:00.000 2006-02-01 00:00:00.000 5 1000.0000

JohnO · 2006-02-01 Reply Admin

Actually, Snapshotdate isn't necessary in the select either. Or you can show both. But coalescing snapshotdate isn't very meaningful.

JohnO · 2006-02-01 Reply Admin

Here's the final script in all it's tested glory:

/*
CREATE TABLE Margin (
SnapshotDate datetime not null,
FlowDate datetime not null,
TransactionID int not null,
Amount money not null
)

ALTER TABLE Margin ADD
CONSTRAINT PK_Margin PRIMARY KEY CLUSTERED (SnapshotDate, FlowDate, TransactionID)
GO
*/

INSERT Margin VALUES ('01/30/2006', '02/01/2006', 1, 50000)
INSERT Margin VALUES ('01/30/2006', '02/01/2006', 2, 150000)
INSERT Margin VALUES ('01/30/2006', '02/01/2006', 3, 200000)
INSERT Margin VALUES ('01/30/2006', '02/01/2006', 4, 600)

INSERT Margin VALUES ('01/31/2006', '02/01/2006', 1, 40000)
INSERT Margin VALUES ('01/31/2006', '02/01/2006', 3, 210000)
INSERT Margin VALUES ('01/31/2006', '02/01/2006', 4, 700)
INSERT Margin VALUES ('01/31/2006', '02/01/2006', 5, 1000)

DECLARE @BaseLine datetime,
@Current datetime

SELECT @BaseLine = '01/30/2006'
SELECT @Current = '01/31/2006'

SELECT BaseLine = BL.SnapshotDate,
[Current] = CT.SnapshotDate,
FlowDate = COALESCE(BL.FlowDate, CT.FlowDate),
TransactionID = COALESCE(BL.TransactionID, CT.TransactionID),
Amount = COALESCE(CT.Amount, 0) - COALESCE(BL.Amount, 0)
FROM ( SELECT * FROM Margin WHERE SnapshotDate= @BaseLine )BL
FULL OUTER JOIN ( SELECT * FROM Margin WHERE SnapshotDate= @Current ) CT
ON CT.FlowDate = BL.FlowDate
AND CT.TransactionID = BL.TransactionID

/* expected results
2006-01-30 00:00:00.000 2006-01-31 00:00:00.000 2006-02-01 00:00:00.000 1 -10000.0000
2006-01-30 00:00:00.000 NULL 2006-02-01 00:00:00.000 2 -150000.0000
2006-01-30 00:00:00.000 2006-01-31 00:00:00.000 2006-02-01 00:00:00.000 3 10000.0000
2006-01-30 00:00:00.000 2006-01-31 00:00:00.000 2006-02-01 00:00:00.000 4 100.0000
NULL 2006-01-31 00:00:00.000 2006-02-01 00:00:00.000 5 1000.0000
*/

WTF Batman · 2006-02-01 Reply Admin

Anonymous:
"By the way, Sybase really needs the SQLServer "join" syntax so where clauses can be less WTF-ery in and of themselves. "
More recent versions of Sybase (certainly from 12.5 onwards, possibly eralirer, can't remember) do indeed allow ANSI JOIN syntax.

Holy crap! You're absolutely right! We just upgraded to 12.5 from 11.x, but I hadn't realize that this syntax had been added. Thanks!

2006-02-01 Reply Admin

Jeff S:
Anonymous:
how about actually doing the inner joins?

Select col1, col2, col 3
from a inner join b on a.c4 = b.c4
inner join c on a.c4 = c.c4
inner join d on a.c4 = d.c4
?
I don't know maybe it's just me but actually have a real join as opposed to the where clauses helps

So what's the difference?

What does "a real join" give you that the where clause that joins the two tables doesn't?

Aside from being verbose to the point of being cobolesque, and not backwards compatible, that is.

"real joins" give you:

1) a much clearer, more logical SELECT with
a) a clear distinction between the join condition and the criteria
b) a clear, concise FROM clause to indicate your driving recordset (note I did not say "table") which is the most important part of any SELECT with joins

Um, huh? When using the JOIN syntax it seems to me that you get a long, heterogeneous FROM clause versus the short, homogeneous list of table tables when using the WHERE clause.

2) The ability to clearly express OUTER joins, and join order precendence (though this is rarely necessary, it can be handy)

The question is really explicit INNER JOINS vs implicit inner joins via the WHERE clause, so let's leave OUTER JOINS out of it.

3) The ability to add criteria to the join expression itself (technically only needed for outer joins, but it can make inner joins more clear as well):

SELECT ..
FROM A
LEFT OUTER JOIN B
ON A.ID = B.ID AND B.Status = "Active"

Again, this seems to take all the conditions, which were once in the same place in a WHERE clause and scatter them all about.

4) the ability to see explicitly when a CROSS JOIN is being done on purpose, as opposed to leaving yourself to manually read the entire WHERE clause to check to see if all tables are joined properly and wondering if the cartesian product is intential or not.

This sounds reasonable.

5) "Backwards-compatable"? uh, I guess not. you are correct. that is why I never use C#, because it is not backwards compatable with C, or CSS because it is not backwards compatable with netscape 2.0.

Agreed.

Finally,

6) It is ANSI standard to use join syntax. that means it will be FORWARD compatable. I will let you decide for yourself which is more important, to be forward-compatable or backwards compatable.

Are there databases that do not support doing implicit joins via the WHERE clause? (I'm serious, I do not know, I do not pretend to have that much experience...)

In summary it seems that all your answers go to style and maintenance, correct? Are there any performance benefits?

Thanks,
H.Q.

JohnO · 2006-02-01 Reply Admin

Anonymous:
Um, huh? When using the JOIN syntax it seems to me that you get a long, heterogeneous FROM clause versus the short, homogeneous list of table tables when using the WHERE clause.

The FROM/JOIN/ON syntax is much easier to read because it places the linking clause between two tables by the names of the tables in the query. Force yourself to use it for one day and I guarantee you will never switch back, unless maybe you are a really bad typist.

Anonymous:
The question is really explicit INNER JOINS vs implicit inner joins via the WHERE clause, so let's leave OUTER JOINS out of it.

Actually, outer joins are where the ansi-92 syntax is fundamentally better than the where clause syntax.

Anonymous:
In summary it seems that all your answers go to style and maintenance, correct? Are there any performance benefits?

Only with a badly implemented optimizer.

Jeff S · 2006-02-01 Reply Admin

Anonymous:
Jeff S:

"real joins" give you:

1) a much clearer, more logical SELECT with
a) a clear distinction between the join condition and the criteria
b) a clear, concise FROM clause to indicate your driving recordset (note I did not say "table") which is the most important part of any SELECT with joins

Um, huh? When using the JOIN syntax it seems to me that you get a long, heterogeneous FROM clause versus the short, homogeneous list of table tables when using the WHERE clause.

2) The ability to clearly express OUTER joins, and join order precendence (though this is rarely necessary, it can be handy)

The question is really explicit INNER JOINS vs implicit inner joins via the WHERE clause, so let's leave OUTER JOINS out of it.

3) The ability to add criteria to the join expression itself (technically only needed for outer joins, but it can make inner joins more clear as well):

SELECT ..
FROM A
LEFT OUTER JOIN B
ON A.ID = B.ID AND B.Status = "Active"

Again, this seems to take all the conditions, which were once in the same place in a WHERE clause and scatter them all about.

4) the ability to see explicitly when a CROSS JOIN is being done on purpose, as opposed to leaving yourself to manually read the entire WHERE clause to check to see if all tables are joined properly and wondering if the cartesian product is intential or not.

This sounds reasonable.

5) "Backwards-compatable"? uh, I guess not. you are correct. that is why I never use C#, because it is not backwards compatable with C, or CSS because it is not backwards compatable with netscape 2.0.

Agreed.

Finally,

6) It is ANSI standard to use join syntax. that means it will be FORWARD compatable. I will let you decide for yourself which is more important, to be forward-compatable or backwards compatable.

Are there databases that do not support doing implicit joins via the WHERE clause? (I'm serious, I do not know, I do not pretend to have that much experience...)

In summary it seems that all your answers go to style and maintenance, correct? Are there any performance benefits?

Thanks,
H.Q.

For #1 -- if you feel that

select 
 ...
from 
  a,b,c,d
where 
 a.id = b.id and
 b.id=c.id and
 b.col = c.col and
 b.startDate > '1/1/2000' and
 c.status='active' and
 d.id *= c.id and
 d.col *=  c.col and
 d.value *= 0

is more logical and it is easier to understand and to edit the "structure" of the SELECT rather than

select
  ...
from
  a
inner join b 
  on a.id = b.id
inner join c 
  on b.id = c.id and 
       c.col = b.col
left outer join d
  on d.id = c.id and 
      d.col = c.col and 
      c.value = 0
where
  b.startDate > '1/1/200' and
  c.status = 'active'

then there's not much I can do to convince you otherwise, I guess. yes, the first one is shorter, i will give you that. Even without the outer join, it is much easier and clearer to see how you have structured your SELECT and what the relations are between the tables.

(is that c.value *=0 syntax even legal? Can you write c.value *> 12 ? I don't even know if I did that right, it is so f'ed up to use that *= syntax I have no idea which need to go on which side .. how do you put in outer join criteria w/o using ANSI syntax?)

For #3 -- you are missing the effect of putting the extra (non-join) criteria in the outer join. that is NOT the same as putting that criteria in the WHERE clause -- it returns completely different results.

emurphy · 2006-02-01 Reply Admin

Jeff S:
if you feel that
select 
 ...
from 
  a,b,c,d
where 
 a.id = b.id and
 b.id=c.id and
 b.col = c.col and
 b.startDate > '1/1/2000' and
 c.status='active' and
 d.id *= c.id and
 d.col *=  c.col and
 d.value *= 0    
is more logical and it is easier to understand and to edit the "structure" of the SELECT rather than
select
  ...
from
  a
inner join b 
  on a.id = b.id
inner join c 
  on b.id = c.id and 
       c.col = b.col
left outer join d
  on d.id = c.id and 
      d.col = c.col and 
      c.value = 0
where
  b.startDate > '1/1/200' and
  c.status = 'active'
then there's not much I can do to convince you otherwise, I guess. yes, the first one is shorter, i will give you that. Even without the outer join, it is much easier and clearer to see how you have structured your SELECT and what the relations are between the tables.

Personally, I prefer the following:

select
  ...
from              a
       inner join b on a.id = b.id
       inner join c on b.id = c.id and c.col = b.col
  left outer join d on c.id = d.id and c.col = d.col and c.value = 0
where
  b.startDate > '1/1/200' and c.status = 'active'

emurphy · 2006-02-01 Reply Admin

emurphy:

Personally, I prefer the following:

select
  ...
from              a
       inner join b on a.id = b.id
       inner join c on b.id = c.id and c.col = b.col
  left outer join d on c.id = d.id and c.col = d.col and c.value = 0
where
  b.startDate > '1/1/200' and c.status = 'active'

or, slight improvement:

select
  ...
from              a
       inner join b on a.id = b.id
       inner join c on b.id = c.id and b.col = c.col
  left outer join d on c.id = d.id and c.col = d.col and c.value = 0
where
  b.startDate > '1/1/200' and c.status = 'active'

Note the consistent pattern of "left.something = (right.something or constant)" in the join conditions.

JohnO · 2006-02-01 Reply Admin

Jeff S:
JohnO -- it's OK to learn something now and then. Sometimes, when someone says "there are better ways", instead of instantly thinking they are stupid and trying to prove them wrong, you should sometimes consider listening to them.

You're right I'll never cease to be amazed at how stubborn some people are and that I should always test my SQL, even it's just to prove Jeff S wrong :).

If you still aren't convinced, let me try some inductive reasoning.

1) We both agree that inner join is useful

2) We both agree that left join is useful

4) We both agree that cross join is useful

What's missing between 2 and 4? For some reason you think full outer join is not useful and should never be used in a properly designed schema. I accept all of the joins and see that each has it's place.

I didn't make this case that I've presented up. I simplified it down to it's essential elements. I explained the real-world business scenario behind it. I provided sample data to demonstrate how it should work. You can view the query plan and see it's quite good. I submit that there's no clearly more straightforward or better performing way to achieve this.

rbriem · 2006-02-01 Reply Admin

Anonymous:
how about actually doing the inner joins?

Select col1, col2, col 3
from a inner join b on a.c4 = b.c4
inner join c on a.c4 = c.c4
inner join d on a.c4 = d.c4
?
I don't know maybe it's just me but actually have a real join as opposed to the where clauses helps

So what's the difference?

What does "a real join" give you that the where clause that joins the two tables doesn't?

Aside from being verbose to the point of being cobolesque, and not backwards compatible, that is.

As stated above, "real joins" have many benefits, not the least of which is readability.

But on the topic of "backwards compatible" - I recently worked on a honkin' big database conversion project that involved a mix of (old) Ingres on VMS, (newer) Ingres on Unix, and Oracle on Unix, and that imported and exported data to and from other huge and complex databases (both internal and external to the company).

Some of the database engines supported the SQL92 "join on" syntax, some didn't. All of our joins (some of which were monsters) were written using "where"s so that we could easily switch between systems.

Thus making the developers who worked mainly on the newer systems "backwards compatible" with the older syntax.

Jeff S · 2006-02-01 Reply Admin

JohnO:
Jeff S:
JohnO -- it's OK to learn something now and then. Sometimes, when someone says "there are better ways", instead of instantly thinking they are stupid and trying to prove them wrong, you should sometimes consider listening to them.

You're right I'll never cease to be amazed at how stubborn some people are and that I should always test my SQL, even it's just to prove Jeff S wrong :).

If you still aren't convinced, let me try some inductive reasoning.

1) We both agree that inner join is useful

2) We both agree that left join is useful

4) We both agree that cross join is useful

What's missing between 2 and 4? For some reason you think full outer join is not useful and should never be used in a properly designed schema. I accept all of the joins and see that each has it's place.

I didn't make this case that I've presented up. I simplified it down to it's essential elements. I explained the real-world business scenario behind it. I provided sample data to demonstrate how it should work. You can view the query plan and see it's quite good. I submit that there's no clearly more straightforward or better performing way to achieve this.

John -- did you miss all of my posts where I gave a simplier and more efficient solution?

I'll repost again for you:

DECLARE @BaseLine datetime, @Current datetime

SELECT @BaseLine = '01/30/2006'
SELECT @Current = '01/31/2006'

#1:

select FlowDate, TransactionID,
    sum(case when SnapShotDate=@BaseLine then Amount else 0 end) -
    sum(case when SnapShotDate=@Current then Amount else 0 end) as Difference
from
   Margin
where
   SnapShotDate between @BaseLine and @Current
group by FlowDate, transactionID
order by FlowDate, transactionID

#2:

SELECT SnapshotDate = COALESCE(BL.SnapshotDate, CT.SnapshotDate),
FlowDate = COALESCE(BL.FlowDate, CT.FlowDate),
TransactionID = COALESCE(BL.TransactionID, CT.TransactionID),
Amount = COALESCE(CT.Amount, 0) - COALESCE(BL.Amount, 0)
FROM ( SELECT * FROM Margin WHERE SnapshotDate= @BaseLine )BL
FULL OUTER JOIN ( SELECT * FROM Margin WHERE SnapshotDate= @Current ) CT
ON CT.FlowDate = BL.FlowDate
AND CT.TransactionID = BL.TransactionID
order by COALESCE(BL.FlowDate, CT.FlowDate), COALESCE(BL.TransactionID, CT.TransactionID)

(note that i added an ORDER BY clause to yours to ensure that we are returning the same results. Ordering results and further joining of results is another big problem with FULL OUTER JOINS).

I'll let you decide for yourself which is simplier, and I'll let you test for yourself which is more efficient.

Jeff S · 2006-02-01 Reply Admin

JohnO:
Jeff S:
JohnO -- it's OK to learn something now and then. Sometimes, when someone says "there are better ways", instead of instantly thinking they are stupid and trying to prove them wrong, you should sometimes consider listening to them.

You're right I'll never cease to be amazed at how stubborn some people are and that I should always test my SQL, even it's just to prove Jeff S wrong :).

If you still aren't convinced, let me try some inductive reasoning.

1) We both agree that inner join is useful

2) We both agree that left join is useful

4) We both agree that cross join is useful

What's missing between 2 and 4? For some reason you think full outer join is not useful and should never be used in a properly designed schema. I accept all of the joins and see that each has it's place.

I didn't make this case that I've presented up. I simplified it down to it's essential elements. I explained the real-world business scenario behind it. I provided sample data to demonstrate how it should work. You can view the query plan and see it's quite good. I submit that there's no clearly more straightforward or better performing way to achieve this.

What about #5? RIGHT OUTER JOINS = bad. very, very bad .... Do we agree on that?

And do not forget the "join" most under-utilized and misused above all: the UNION [ALL] "join". A very powerful tool for comparing two tables and doing things that you might think only a FOJ can do.

2006-02-01 Reply Admin

triso:
Anonymous:
...I have to wonder whose these people are that manage to get hired without a clue. Seriously - is it just because consulting companies are outsourcing themselves to the lowest bidder?
Perhaps but I always thought it was because the consultants dress well, can communicate well and have good people-skills. None of which are technical geek traits.

Those aren't consultants, they're con-men!

JohnO · 2006-02-01 Reply Admin

You should have SnapshotDate IN (@BaseLine, @Current). They don't have to be consecutive dates and between is wrong. I want a variance between two days.

I don't agree that your:

Amount = sum(case when SnapShotDate=@BaseLine then Amount else 0 end) -
sum(case when SnapShotDate=@Current then Amount else 0 end)

is cleaner than:

Amount = COALESCE(CT.Amount, 0) - COALESCE(BL.Amount, 0)

The performance of your query is not any better than mine.

Finally, the real problem with your solution is that it relies on the fact that what I posted is essentially a self-join. In the real world example, the current margin is in a different table than the snapshots. So if the user wants to see a variance between the current state of the data and a snapshot, the second subquery is on a different table.

I find it highly amusing that you so easily declare a join type that a lot of smart people decided was useful when putting together the ANSI-92 spec useless.

wk633 · 2006-02-01 Reply Admin

JohnO:
Anonymous:
Anonymous:

All was fine when system was very small, but as data got added the report that had this query ran slower, and slower, and slower. This is how the SQL was fixed.

select col1, col2, col3 from a, b, c, d where a.c4 = b.c4 and b.c4 = c.c4 and c.c4 = d.c4;

how about actually doing the inner joins?

Select col1, col2, col 3
from a inner join b on a.c4 = b.c4
inner join c on a.c4 = c.c4
inner join d on a.c4 = d.c4
?
I don't know maybe it's just me but actually have a real join as opposed to the where clauses helps.

I agree 100%. Anyone still not using the join syntax should be fired right now.

ok, call me the Devil's Advocate... but is it more than efficiency? I've been told that where clauses can return incorrect results. That I just don't get. Inneficient, sure. Wrong? Is there a bug in SQL Server that has never been reported anywhere?

JohnO · 2006-02-01 Reply Admin

Jeff S:
And do not forget the "join" most under-utilized and misused above all: the UNION [ALL] "join". A very powerful tool for comparing two tables and doing things that you might think only a FOJ can do.

Taking that argument to it's logical conclusion, we don't need anything but the six fundamental relational operations, which does not include any join but CROSS JOIN. Are you advocating that?

Jeff S · 2006-02-01 Reply Admin

JohnO:
You should have SnapshotDate IN (@BaseLine, @Current). They don't have to be consecutive dates and between is wrong. I want a variance between two days.

I don't agree that your:

Amount = sum(case when SnapShotDate=@BaseLine then Amount else 0 end) -
sum(case when SnapShotDate=@Current then Amount else 0 end)

is cleaner than:

Amount = COALESCE(CT.Amount, 0) - COALESCE(BL.Amount, 0)

The performance of your query is not any better than mine.

Finally, the real problem with your solution is that it relies on the fact that what I posted is essentially a self-join. In the real world example, the current margin is in a different table than the snapshots. So if the user wants to see a variance between the current state of the data and a snapshot, the second subquery is on a different table.

I find it highly amusing that you so easily declare a join type that a lot of smart people decided was useful when putting together the ANSI-92 spec useless.

#1 -- First off, the performance is twice as good, and it gets even better when you need to join to other tables as mentioned (the FOJ kills your indexes).

#2 -- RIGHT OUTER JOINS are even *worse* than FOJ's, and they are valid as well. I hope we can agree on that. FOJ's might have their use when you need to hack together something quick, but for a production database they sure don't make much sense as I've shown.

#3 -- comparing two tables is easily done with a simple UNION ALL and then a GROUP BY, almost the same as I have done in my example. see http://weblogs.sqlteam.com/jeffs/archive/2004/11/10/2737.aspx for some examples.

#4 -- yes, you can use OR if you like. Either that or IN() if you prefer. It's funny, with an OR and about 5 rows, the performance comes out the same. Add a couple of thousand of rows in there and then recheck it. then try joining the results to another table (say, a Transactions table on TransactionID) and let me know how that goes.

#5 -- And finally, I guess I give up. I took an awful lot of time to show you some techniques you obviously were not aware of, and if you think that in these situations we talked about the FOJ solution is cleaner and more efficient than simple GROUP BY's, then all I can say is best of luck to you.

Jeff S · 2006-02-01 Reply Admin

JohnO:
Jeff S:
And do not forget the "join" most under-utilized and misused above all: the UNION [ALL] "join". A very powerful tool for comparing two tables and doing things that you might think only a FOJ can do.

Taking that argument to it's logical conclusion, we don't need anything but the six fundamental relational operations, which does not include any join but CROSS JOIN. Are you advocating that?

umm ... huh? can you expand on that? you lost me there.

JohnO · 2006-02-01 Reply Admin

Jeff S:
What about #5? RIGHT OUTER JOINS = bad. very, very bad .... Do we agree on that?

"A RIGHT JOIN B" = "B LEFT JOIN A"

I don't like right join because it's really just a syntactical variation of left join. I can say that I truly never have used right join because when I am writing a FROM clause, I always like to start with the table that has all the values (assuming there is one :)).

Jeff S · 2006-02-01 Reply Admin

JohnO:
Jeff S:
What about #5? RIGHT OUTER JOINS = bad. very, very bad .... Do we agree on that?

"A RIGHT JOIN B" = "B LEFT JOIN A"

I don't like right join because it's really just a syntactical variation of left join. I can say that I truly never have used right join because when I am writing a FROM clause, I always like to start with the table that has all the values (assuming there is one :)).

Yes! very true, and I am happy to hear that. You just moved up a few notches in my book.

JohnO · 2006-02-01 Reply Admin

Jeff S:

#5 -- And finally, I guess I give up. I took an awful lot of time to show you some techniques you obviously were not aware of, ...

Your arrogance is astounding. I've used SUM() on a CASE many times to achieve a single pass through the table.

Jeff S:
JohnO:

Jeff S:
And do not forget the "join" most under-utilized and misused above all: the UNION [ALL] "join". A very powerful tool for comparing two tables and doing things that you might think only a FOJ can do.

Taking that argument to it's logical conclusion, we don't need anything but the six fundamental relational operations, which does not include any join but CROSS JOIN. Are you advocating that?

umm ... huh? can you expand on that? you lost me there.

http://en.wikipedia.org/wiki/Relational_algebra

JohnO · 2006-02-01 Reply Admin

Jeff S:
#2 -- RIGHT OUTER JOINS are even *worse* than FOJ's, and they are valid as well.

To me you lost all credibility with this statement. You seem to lack some basic understanding of how a good query optimizer woks. If two statements are functionally equivalent (which you can prove with relational algebra), then an ideal optimizer will come up with the same plan regardless of which statement you use. Oviously, you can't achive the "ideal" or perfect optimizer. However, I would hope that you would agree that any "good" optimizer can easily transform between left and right join.

Reporting All Through The Night

Leave a comment on “Reporting All Through The Night”