Two column join performance

cncpvv · September 4, 2017, 12:09am

Two tables each with 6.4 million records in SQL Server 2012. Let's say A and B. I have to join two tables based on two columns. Lets say A.C1 = B.C1 and also A.C2 <> B.C2. Both C1 and C2 are nvarchar columns. Table B does not have cluster index. I have the query as show below:

Select B.C1...A.C12, A.C2...A.C5 from A inner join B on A.C1 = B.C1 Where A.C2 <> B.C2

This is taking a more than 16 min to 30 min. It does a full table scan on B as other columns from B. When it takes more than 30 mins, the query timeouts. Can you please let me know what is the best way to write this query.

Jason_A_Long · September 4, 2017, 1:51am

Since you don't have a clustered index on table B, this would be a good opportunity to do so. C1 as the 1st key and C2 as the 2nd. You'll also want a similar (C1, C2), nonclustered index on A. Depending on columns in A, you may or may not want it to be a covering index... A covering index will prevent the need to do a key lookup but will result in a much larger index.

In terms of readability, I'd format the code as so...

SELECT
    b.C1,
    ...
    b.C12,
    a.C2,
    ...
    a.C5
FROM
    A a
    JOIN B b
        ON a.C1 = b.C1
        AND a.C2 <> b.C2;

cncpvv · September 4, 2017, 2:11am

Clustered index cannot be created based on the design of the table. Is there any other way to improve the performance?

Jason_A_Long · September 4, 2017, 2:30am

What about the design makes a clustered index impossible?
Are you sure you aren't confusing a clustered index with a primary key?

cncpvv · September 4, 2017, 4:21am

Will create the clustered index and see the execution plan.

cncpvv · September 4, 2017, 4:32am

There are two non clustered indexes on each of the columns on table B. There are no clustered indexes. Will there be any difference in performance between the two? Should I include both the columns on the clustered index?

Jason_A_Long · September 4, 2017, 5:57pm

You'll have to look at the execution plan. Depending on the data distribution, there's a good chance that you'll end up with table scans no matter what you do.
The reason I suggest a clustered index is the fact that they are free. You table data is stored either as a heap or a clustered index. Both take up the same amount of space, but 1 is more useful.

cncpvv · September 5, 2017, 1:53am

Thank you Jason.

cncpvv · September 5, 2017, 2:39pm

Jason, what kind of data distribution would end up with the table scan? Also when you say "1 is more useful", does it mean heap or cluster?

Jason_A_Long · September 5, 2017, 3:59pm

If, for example, 95% of C2 values are matched across both tables based on C!, then there's a possibility that the optimizer may see that in the statics and choose a seek operation. Then again it may not. The only way to find out is to test.

A clustered index is useful because it is order on it's key values. A heap is less useful.

ScottPletcher · September 5, 2017, 5:03pm

You almost certainly shouldn't be doing a join but using a NOT EXISTS. The non-equi join will create a semi-cartesian product that could really bloat the number of rows given the sizes of the table(*).

For a NOT EXISTS, for this specific query, a nonclustered index on B on ( C1, C2 ) will work just as well as clustering the table. However, if the table does not a clustered index, and you (most) always search the table by C1, it's definitely vastly better for performance to cluster the table that way. In fact, the single most important factor to overall database performance is how the tables are clustered.

SELECT A.C1, A.C12, A.C2, A.C5
FROM A
WHERE NOT EXISTS(SELECT 1 FROM B WHERE A.C1 = B.C1 AND A.C2 = B.C2)

(*) For example, say table A contains rows (1, 3) and (1, 4), and table B contains rows (1, 2), (1, 5), (1, 7), (1, 9) and (1, 10), then a total of 10 rows will be listed from a join, but only two rows from a NOT EXISTS.

cncpvv · September 5, 2017, 5:47pm

Scott, It has same execution plan with the solution you mentioned.

cncpvv · September 5, 2017, 5:50pm

Will create a clustered index on table B and let you know.