Help related to Query Tuning using UNION

rahul · August 23, 2015, 7:08pm

hi all,

I have a master table from which I need to copy records into processing table based on the item_code and Item_type.for each item_code there can be multiple item_type.

Master Table(MT)

Key column
mmyy_KEY
Item_code
Item_type

Processing table (PT)

key column
mmyy_kEY
Item_code
Item_type

The processing table is pre-populated with some data but my requirement is to copy the missing item_code and item_type from the master table. If the PT already has the Item_code then the missing item_type needs to be copied for the item_code.

I am writing the below query :

indent preformatted text by 4 spaces
insert into processing table
Key column
Item_code
Item_type
....other misc columns
SELECT
Key column
Item_code
Item_type
....other misc columns
FROM MASTER TABLE
WHERE mmyy_KEY = 1408
AND ITEM_CODE NOT IN
(
SELECT DISTINCT(ITEM_CODE) FROM PT
where mmyy_KEY = 1408
)

union

insert into processing table
Key column
Item_code
Item_type
....other misc columns
SELECT
Key column
Item_code
Item_type
....other misc columns
FROM MASTER TABLE
WHERE mmyy_KEY = 1408
and item_code in
(
SELECT DISTINCT(ITEM_CODE) FROM PT
where mmyy_KEY = 1408
)

indent preformatted text by 4 spaces

T he approx number of records for each item_code and item_type is around 20000K . so if there are 5 unique item_code then around 100000 records would be copied based on the above query. This process would be run on adhoc basis (around 3-4 times in a month)

Please suggest if the query is fine or can be tuned.

Thanks in Advance!!

bitsmed · August 23, 2015, 7:33pm

I would think, this will do the trick:

insert into [processing table]([Key column]
                            ,Item_code
                            ,Item_type
                            ,....other misc columns
                            )
 select [Key column]
       ,Item_code
       ,Item_type
       ,....other misc columns 
   from [MASTER TABLE] as mt
  where mmyy_KEY=1408
    and not exists(select 1
                     from [processing table] as pt
                    where pt.mmyy_KEY=mt.mmyy_KEY
                      and pt.ITEM_CODE=mt.Item_code
                  )
;

Kristen · August 23, 2015, 10:37pm

I use an OUTER JOIN, but I think the Query Plan it makes is identical to Bitsmeds "Not Exists" solution. I just find that I can change the OUTER JOIN to INNER (e..g for an equivalent UPDATE on rows that are different, rather than "missing", and thus my code is more reusable, and less chance of bugs as a consequence. Its just Personal Choice though)

Our House Style in code like this is to use Table Alias of "S" for Source and "D" for Destination just to try to ensure that when we reference columns in JOIN or WHERE clause we more naturally refer to the right table (alias). Again, its personal preference though.

insert into [processing table]
(
    [Key column]
    ,Item_code
    ,Item_type
    ,....other misc columns
)
 select [Key column]
       ,Item_code
       ,Item_type
       ,....other misc columns 
FROM [MASTER TABLE] as S
    LEFT OUTER JOIN [processing table] as D
         ON D.mmyy_KEY= S.mmyy_KEY
        AND D.ITEM_CODE = S.Item_code
WHERE     D.mmyy_KEY IS NULL	-- Only rows that do not already exist
      AND S.mmyy_KEY=1408

For your "already exists" second UNION you can repeat the code but with an INNER JOIN (and you don't then need the "D.mmyy_KEY IS NULL" in the WHERE clause.

I presume that ALL rows should be processed - i.e. you do NOT want SQL to check for, and remove, Duplicates? in which case you should use "UNION ALL" - which will be faster (it avoids the Sort and De-Dupe steps)

Provided you have suitable indexes on the JOIN columns this should be fast. Using UNION ALL I would not expect the combined pair of statements to be any different in speed to two separate statements, but they will be ATOMic.

However, if you are inserting a huge number of rows and IF you do NOT need the two sections to be ATOMic then you can lighten the load on the Transaction Log by making them as two separate statements. It would be important NOT to use a TRANSACTION for this to have any possible benefit; if you are using SIMPLE Recovery Model then a CHECKPOINT after the first statement will enable the TLog to be reused for the second statement. If you are using FULL Recovery Model a TLog backup can be backing up, and then freeing up, the log space of the first INSERT whilst the second one is in-progress.

But unless the two operations are genuinely separate it would be much better to have them within a single transaction - probably most easily done by using UNION ALL

Kristen · August 23, 2015, 10:38pm

PLEASE NOTE: I made a critical correction to the code above.

rahul · August 24, 2015, 1:02am

Many Thanks Bitsmed and Kristen!

I will try both the solutions and let you know. I forget to mention that I will be running this on Teradata Database.

Kristen · August 24, 2015, 9:17am

Is Teradata mounted on top of MS SQL Server, or is it a completely different RDBMS?

(This is an MS SQL Server forum, so suggestions here may not work on other systems and folk here may not have knowledge of other systems syntax / tuning / etc.)

rahul · August 25, 2015, 4:06am

Hi Kristen,

Yes, it's a different RDBMS but I can check the plan for performance.

I have checked both the queries. The first query is working absolutely fine but has some performance issues in Teradata.

LEFT OUTER JOIN performs better but how can I eliminate the additional columns from processing table which are not matching. Also, I forgot to mention that a key column will be unique between the tables. so my query is as below .
insert into [processing table]
(
[Key column]
,Item_code
,Item_type
,....other misc columns
)
select [Key column]
,Item_code
,Item_type
,....other misc columns
FROM [MASTER TABLE] as S
LEFT OUTER JOIN [processing table] as D
ON D.KEY= S.KEY
AND mmyy_KEY= S.mmyy_KEY
---AND D.ITEM_CODE = S.Item_code (not required)
WHERE D.mmyy_KEY IS NULL -- t
AND S.mmyy_KEY=1408 -- This will be true for all rows where KEY doesn't exist.

I believe inner join is not required.But the problem is that the select is giving additional columns which are NULLS and insert is failing. Please suggest how can I suppress the additional columns from processing table so that the number of insert columns and select columns match.

Many thanks for your help!

rahul · August 25, 2015, 5:01am

Hi,

I think I got the solution. I can mention the column names explicitly in the select statement instead of using *.
Alternatively , is their any other JOIN condition which can be used.

Kristen · August 25, 2015, 7:39am

You can prefix the column names by the Alias for the table. Sorry, I should have done that; this is always necessary when you join the same table twice in a query (I would have expected SQL to raise an error though as the column names, without an alias, would be ambiguous?)

Alternative is to use MyAlias.* for "All Columns". This is normally frowned on but may actually be helpful in this case (and I, too, do do this when I am settings up a "Copy everything from A to B" scenario. One issue is that if a column is added to Source table then INSERT INTO TargetTable SELECT * FROM SourceTable will break (until the additional column is added to TargetTable), but if a change is made to Source Table that should also be made to Target Table, in this scenario, then perhaps no bad thing if it breaks if the change to Target Table is overlooked!

Revised code with aliases:

insert into [processing table]
(
[Key column]
...
)
select S.[Key column]
,S.Item_code
,S.Item_type
,....
FROM [MASTER TABLE] as S
LEFT OUTER JOIN ...
...

Or you could use

INSERT ...
SELECT S.*
FROM [MASTER TABLE] as S
LEFT OUTER JOIN ...
...