Massive SQL JOIN

sjeakins · March 24, 2017, 3:18pm

Hey Guys,

AND (VW.fieldA IS NULL OR P.fieldA IS NULL OR VW.fieldA = P.fieldA)
AND (VW.fieldB IS NULL OR P.fieldB IS NULL OR VW.fieldB = P.fieldB)

I have a table join using this clause on 53 different fields. It takes over an hour to run.

(The strategy is that a record is only eliminated when they both have a value and the values are not equal. NOTE: ISNULL will not work becuase it's a match if either field is null and the other has any value)

Is there smarter logic?

Thanks

Kristen · March 24, 2017, 5:42pm

When I encouter

WHERE (A = 1 OR B = 2)
      AND C = 3

I have often found that a UNION (to avoid the OR) is faster - particularly if I can use a UNION ALL.

With your 53 different fields not sure if that is going to be any use though ...

How often is

(VW.fieldA IS NULL OR P.fieldA IS NULL OR VW.fieldA = P.fieldA)

TRUE - is that the majority, or minority, or "very rare"?

Is one of the fields being NULL more rare than then fields being EQUAL?

I'm just wondering whether it is worth doing something like

INSERT INTO #TEMP
SELECT PKeyCol1, PKeyCol2,
       [FlagFieldA] = CASE WHEN VW.fieldA IS NULL THEN 1 ELSE 0 END,
       [FlagFieldB] = CASE WHEN VW.fieldB IS NULL THEN 1 ELSE 0 END,
       ...
FROM MyTable AS VW

so that on the first pass we gather all the NULL states. It would be helpful to know if ALL FlagFields are 1 - by having some sort of [AllFieldsSet] column. Once all FlagFields are 1 then the row needs no further processing - it is a match.

Then on second pass only process rows where [AllFieldsSet] = 0 and set columns to 1 if P.FieldX IS NULL AND #TEMP.FlagFieldX = 0

Might just be possible that this beats SQL's query planner ... but ... in the main, of course, it doesn't happen that some programming-around is quicker than the Query Planner.

Second thought:

Take the rarest event that is likely to be true. Let's say that is FieldC

INSERT INTO #TEMP
SELECT *
FROM
(
    (
        SELECT PKeyCol1, PKeyCol2
        FROM MyTable AS VW
        WHERE VW.fieldC IS NULL
        UNION
        SELECT PKeyCol1, PKeyCol2
        FROM MyTable2 AS P
        WHERE P.fieldC IS NULL
    ) UNION ALL
    SELECT PKeyCol1, PKeyCol2
    FROM MyTable AS VW
        JOIN MyTable2 AS P
            ON P.fieldC = VW.fieldC
) AS X

and then JOIN #TEMP to the other FieldY tests to reduce the number of rows that are compared.

ScottPletcher · March 24, 2017, 7:41pm

Can you add a bit or tinyint column as a flag to the table rows to indicate matched / unmatched / unknown|needs-checked? Or do all the rows change every day??

The idea being to avoid having to check unchecked rows the next time. You'd only have to checked new/modified rows.

sjeakins · March 24, 2017, 8:49pm

They change all the time...

sjeakins · March 24, 2017, 8:57pm

This is a mixed bag. I'm comparing catalogs (XML Imports) to the existing schema. Comparing these fields is equivalent to matching certain subsets.

Interestingly, though, table A is view. I could do a couple of things
1) Join the tables directly (I just don't see the advantage - that's where the field count comes from, 53 tables)
2) Add a flag field to the view. Can you explain how I might gain by comparing the flag field rather than null?

I also wondered about progressive subsets. I can extract say, 10 field comparisons, putting those matching rows into a subset (temp table). I could then compare, say, 10 fields of the subset against the view into a new subset (temp table) ... etc. to end. Would I gain performance with this?

Thanks for your reply, btw.

Kristen · March 25, 2017, 6:30am

I don't think you will, on its own.

What I was hoping might work is to establish where FieldA matches (either on one of the pair of columns being NULL or the fields being EQUAL). For a single column that is indexed this should be quick.

Provided you could do this for the most, or "one of the most", selective columns first you would then have a subset of the total data which is, say, only 10% of the rows.

If you then only test those rows for the remainder of the column tests you will only be checking 10% of the total data with the "big JOIN query"

I don't think doing 10 first will help - it will be a scan of the whole table. What we want is that an INDEX can be used which covers the query, and is very quick, to get the first "cut" of the data. But ... you would have to be able to predict which column is likely to be highly selective