SQLTeam.com | Weblogs | Forums

Two column join performance

Two tables each with 6.4 million records in SQL Server 2012. Let's say A and B. I have to join two tables based on two columns. Lets say A.C1 = B.C1 and also A.C2 <> B.C2. Both C1 and C2 are nvarchar columns. Table B does not have cluster index. I have the query as show below:

Select B.C1...A.C12, A.C2...A.C5 from A inner join B on A.C1 = B.C1 Where A.C2 <> B.C2

This is taking a more than 16 min to 30 min. It does a full table scan on B as other columns from B. When it takes more than 30 mins, the query timeouts. Can you please let me know what is the best way to write this query.

Since you don't have a clustered index on table B, this would be a good opportunity to do so. C1 as the 1st key and C2 as the 2nd. You'll also want a similar (C1, C2), nonclustered index on A. Depending on columns in A, you may or may not want it to be a covering index... A covering index will prevent the need to do a key lookup but will result in a much larger index.

In terms of readability, I'd format the code as so...

    A a
    JOIN B b
        ON a.C1 = b.C1
        AND a.C2 <> b.C2;

Clustered index cannot be created based on the design of the table. Is there any other way to improve the performance?

What about the design makes a clustered index impossible?
Are you sure you aren't confusing a clustered index with a primary key?

Will create the clustered index and see the execution plan.

There are two non clustered indexes on each of the columns on table B. There are no clustered indexes. Will there be any difference in performance between the two? Should I include both the columns on the clustered index?

You'll have to look at the execution plan. Depending on the data distribution, there's a good chance that you'll end up with table scans no matter what you do.
The reason I suggest a clustered index is the fact that they are free. You table data is stored either as a heap or a clustered index. Both take up the same amount of space, but 1 is more useful.

Thank you Jason.

Jason, what kind of data distribution would end up with the table scan? Also when you say "1 is more useful", does it mean heap or cluster?

If, for example, 95% of C2 values are matched across both tables based on C!, then there's a possibility that the optimizer may see that in the statics and choose a seek operation. Then again it may not. The only way to find out is to test.

A clustered index is useful because it is order on it's key values. A heap is less useful.

You almost certainly shouldn't be doing a join but using a NOT EXISTS. The non-equi join will create a semi-cartesian product that could really bloat the number of rows given the sizes of the table(*).

For a NOT EXISTS, for this specific query, a nonclustered index on B on ( C1, C2 ) will work just as well as clustering the table. However, if the table does not a clustered index, and you (most) always search the table by C1, it's definitely vastly better for performance to cluster the table that way. In fact, the single most important factor to overall database performance is how the tables are clustered.

SELECT A.C1, A.C12, A.C2, A.C5

(*) For example, say table A contains rows (1, 3) and (1, 4), and table B contains rows (1, 2), (1, 5), (1, 7), (1, 9) and (1, 10), then a total of 10 rows will be listed from a join, but only two rows from a NOT EXISTS.

Scott, It has same execution plan with the solution you mentioned.

Will create a clustered index on table B and let you know.