Relational Databases and No-SQL Databases take two key different philosophies, by and large. No-SQL is hard to talk about in broad terms, as it's mostly a set of unrelated technologies all solving the data storage problem in different ways. But we can still make some broad generalizations.
In No-SQL-land, we mostly store data the way we plan to query it. Ad-hoc queries are likely a bad choice, if they're even allowed. In RDBMSes, we store data according to a platonic ideal of normal forms, driven by the data in our domain. We can query the data however we like, using a language that only declares the data we want, not how to get it. Indexes and views and other behind-the-scenes structures make the query efficient.
Neither approach is wrong or bad or better, they are tuned to different ways of thinking about the problem. But the "invisible performance" of RDBMSes means that developers can invent new footguns without even knowing it.
Dotan had a database query that went from "performing fine" to "performing poorly" in the space of a few hours. This was the query:
select bars.*, DATE_FORMAT(bars.date_entered,'%Y-%m-%d %h:%i') as entered,
sales_reps.name as rep_name,
case monthly_sales when 'Not in business yet' then 'no business'
else monthly_sales end as new_monthly_sales,
left(bars.name,20) as new_name, left(bars.company,15) as new_company,
left(bars.email,23) as new_email,
(select max(action_date) from bar_action
where bar_action.bar_id=bars.id) as modify_date,
(select count(distinct foo_id) from bar_foos
where bar_foos.bar_id=bars.id) as total_foos,
bar_action.id AS credit_report_uploaded,
credit_report_requests.id AS credit_report_requested,
applications.id as app_id
from bars LEFT JOIN sales_reps ON bars.rep_id=sales_reps.id
LEFT JOIN applications ON bars.id=applications.bar_id
LEFT JOIN bar_action ON
(bars.id=bar_action.bar_id AND action_desc='credit_report_uploaded')
LEFT JOIN credit_report_requests ON bars.id=credit_report_requests.bar_id
WHERE bars.is_deleted =0 and
bars.status IN ('O/LU','O/LAU','O/LA','O/LAA','IP/L','IP/LA','H/','IP/N')
ORDER BY CASE assign_date WHEN NULL THEN bars.date_entered
WHEN '0000-00-00 00:00:00' THEN bars.date_entered
ELSE assign_date END DESC LIMIT 0,5
I've broken it up across lines, because as found, it was all on one line.
Now, the query itself maybe doesn't have any explicit WTFs in it. It's hard, with SQL, to see what's wrong from the query alone. But there are definitely smells. It's weird that so many fields get left
ed- pulling out the leftmost substring. It's weird that monthly_sales
is a text data type (since it can hold Not in business yet
), and that we have to convert it to no business
for formatting. It's weird that there's a bars.*
at the top, then a few more fields from bars
that get included. It's weird that we're doing some summaries using correlated subqueries (the max
and count
queries).
None of these things are inherently wrong though.
Dotan supplied a little more background, though. Each table listed there has hundreds of thousands of rows in it. Note that LIMIT 0,5
at the end: starting from the zeroth row, take only 5 rows. Again, querying across all these large tables to get only five results isn't wrong, but it's suspicious.
Dotan adds: "Not a single index on the poor database."
And that's your real WTF. The query itself isn't a disaster- it's not great, but by the standards of "terrible SQL queries", I've seen far worse. But the database doesn't have any indexes, at all. The miracle is that this wasn't already running poorly years earlier.
But the query is still bad, though it's hard to say exactly how bad without knowing what it actually needs to return for the application. Dotan can give us a ballpark though:
This could have been done in a single simple query that fits on two lines of my text editor.
So in addition to being a terrible database that isn't leveraging the main feature SQL databases need for performance, the query is definitely over-engineered.
We can assume Dotan replaced this query with his shorter version, after a round of testing to make sure he wasn't introducing any regressions. As to whether or not any indexes ever got added, we can only hope.