Types of joins

Thunder_86 · January 5, 2016, 7:50pm

Hi everybody,

So I've been a programmer for awhile now. I never worked for a huge development firm, but I've always been the best SQL developer in every job I had lol. Doesn't mean I know everything.

While working on one of our modules, I stumbled on a query that does a "INNER MERGE JOIN". I've never seen that before. At first I thought it was a little bit like the "LEFT OUTER JOIN" vs "LEFT JOIN", which are the same thing. The "LEFT JOIN" is an OUTER JOIN, the "OUTER" part being simply implied.

I tried removing the MERGE from the INNER MERGE JOIN, and it gave me the same results, but sorted differently(there is no ORDER BY on the main query), which tells me there is definitely something that's being done differently.

So I read a little bit and found the difference between INNER NESTED LOOP JOIN, INNER MERGE JOIN and INNER HASH JOIN. I'm starting to get part of it, even tho it's still a bit nebulous to me. What would a standard INNER JOIN do? When should I do what? If I understand right, the NESTED LOOP should be used if one of the two tables or queries joined contains VERY few rows (< 10). The Merge join should be used on mid-size tables/queries that have similar number of records and the HASH JOIN should be used for two big tables having a very different number of records.

Is the "INNER JOIN" simply a short for one of those 3 types of joins just like the LEFT JOIN being a short for a LEFT OUTER JOIN? if so, which one is it? If not, shouldn't we avoid normal "INNER JOIN"s and always try to use the most appropriate type of join depending on the situations?

Thanks a lot, I'm always trying to improve, but sadly, we do not use SQL that much here, even tho my boss tries to give me most of the SQL stuff we have to do, everybody else can't wrap their head around for some reasons, and usually end up asking me for help, but I'm still getting a bit rusty and having some trouble getting the difference between those!

Thanks!

gbritton · January 5, 2016, 8:38pm

INNER JOIN tells SQL to use that type of a join. There are the three main types you found. Normally. we let the optimizer choose the join type. This is preferred except in special circumstances

Thunder_86 · January 5, 2016, 8:44pm

Makes sense! I did some testing, and yeah, the optimizer seems to take the best type automatically.

I did find, however, that sometimes, the optimizer has it's limits. Sometimes changing the order of the tables can help getting more performance, or messing with the whole query and sub-queries I can sometimes cut back on the execution time a lot. What would be the best way to optimize a query? I've always been pretty good to find alternate ways to do queries to try to get good performances, but I'm pretty sure that for bigger queries, I could still squeeze out a few milliseconds or sometimes literally shave off a few seconds!

Thanks!

gbritton · January 6, 2016, 4:34pm

Try to order the joins from the smallest expected row set to the largest.
Try not to join more than 6 or 7 tables in one query. Sometimes the optimizer gives up and goes with the simplest (nested loops) even though that is often the worst performer. e.g. If you have 15 tables, join 5, write to a temp table; join the temp table with the next 5 and write to another temp table, then join that one with the last 5. It seems counter intuitive but frequent SQL's execution plan for complex joins includes spills to tempdb anyway. This way, you can control it better.
Don't join on views. join on the underlying tables.
Make sure that your join and where predicates are indexed or included columns

note that what you usually want to optimize is I/O. At least that's where I start. If I see high CPU then I take a second look.

ScottPletcher · January 6, 2016, 7:02pm

What would be the best way to optimize a query?

Big topic! Ultimately it all involves looking at the query plans for the query.

Check the query for obvious errors: functions on columns in WHERE or JOIN clauses. That includes never using ISNULL() in a WHERE or JOIN (it can always be coded around easily enough, with vastly better performance).

In-line table-valued functions are great for performance. But avoid scalar functions if possible. And avoid multi-line table functions like the plague: convert them to inline instead.

More broadly, you must also make sure that you have the best clustered index on every table in the query. Otherwise you end up having to tune every query and add nonclustered indexes you don't need. Once the best CI is in place, then you can look at individual queries and their query plans and determine if you need to add a nonclustered index(es) or if you need to rewrite the query.