Compare Tables in SSIS

How would I compare a Vendor table to our internal staging table and get the exceptions of records in SSIS? The tables are exact structure and schema.

In SQL, I would use an EXCEPT with a link server but I am trying to avoid the link server. How would I do this in SSIS between two data connection (vendor & our staging database)?

In what format are you receiving the vendor table? If you can import it into SQL Server and do the comparison in a query, that would be the fastest. Comparisons from a linked server for tables with anything other than a few rows perform poorly.

I am using a SQL Authentication connection to the vendor's database JoTorre. Their database and server is read-only to us so I cannot create a SP on their server to push data to our server. If I create a SP on my server, then of course, I would have to use a Link Server.

I have a feeling this will be faster in SSIS to avoid Link Server.

So how would I do this. Lets say the connection to the vendor is called Source while our connection is called Destination. Both connections Source and Destination has a table called "Table A" with a primary key called "Key1".

I would like to compare Source Table A to Destination Table A. If there are new records, then insert into Destination Table A. If there are changes to existing records, then update the records. If records in Destination Table A no longer exists in Source, then delete.

Any links to tutorials would be appreciated.

In ssis 2 connections
1 To your server
1 To their server

Dump the data from each to your staging. 2 tables -> Name the tables with compare.vendor compare.babyaqua. run and test to see if data dumps. Then on staging use SQL or a sproc to compare differences

Are there an updated and unique identifier columns in the tables you're comparing?

2 Likes

JoTorre, yes, there is one field that is an identifier and in some case, there are two fields that are an identifier.

Yosia, I got the two connections established but how do I dump it to a staging. Regardless, that is lots of data to dump, more than a million records.

What is your end game, what are you trying to accomplish?

I think you are trying to create a whole synch ecosystem. It would be very difficult to details every single step you need to do especially when dealing with millions of rows. But a shortened version of things

  1. Create 2 dump tables
  2. In your SSIS package canvas create a data flow task (go to menu and choose SSIS and select ToolBox)
  3. In this Data Task Flow (double click and go into it)
    a. create a source and point it to your vendor (SSIS ToolBox) Double click and point to vendor connection and select database
    b. create a destination table in staging (SSIS ToolBox) Double click and point to staging connection and select database

As far as the millions of records, if you need to do a comparison you will have to decide what subset of data you would like pulled, maybe a parameter that filters the rows from vendor. etc

Are you able to add a logging table on vendor db to keep track of changed rows with PK and changed date audit column?

Not sure I am following what you are trying to accomplish - and whether or not your 'staging' table is actually being utilized the way I think.

If the goal is to extract data from your source system - and only insert/update changed data into your destination - then there are a couple of options:

  1. Extract from source - load to staging table on destination - MERGE to target
  2. Extract from source - load to staging table on destination - UPSERT to target
  3. Extract from source - use SCD transformation in SSIS to INSERT/UPDATE to target

For Items 1 and 2 - truncate the staging table - use a simple data flow from the source system to the destination system into a staging table. Use a SQL Execution task to execute a stored procedure on the destination system that performs either a MERGE or UPSERT (INSERT/UPDATE) to the final target table.

For Item 3 - use the SCD transformation, define the source table/query - the destination table - the defined business keys and SSIS will build the data flow as insert/update operations based on the existence of those business keys in the destination.

For any of these - you need to evaluate whether or not you want to update the data that already exists - and what determines that the data should be updated. For example - for MERGE you would add criteria on the WHEN MATCHED:

WHEN MATCHED
AND (
    target.column1 <> source.column1
 OR target.column2 <> source.column2
)
THEN
UPDATE ...

For an UPSERT:

BEGIN TRANSACTION;
UPDATE t
SET ...
FROM target t
JOIN source s ON s.key = t.key
WHERE t.column1 <> s.column1
OR t.column2 <> t.column2

INSERT INTO target
SELECT ...
FROM source s
WHERE NOT EXISTS (SELECT * FROM target t WHERE t.key = s.key)
COMMIT TRANSACTION;

For the UPSERT method - I would recommend putting the whole thing inside a try/catch block with error checking to about the transaction if either the update or insert fails.

Extracting a million rows from the source shouldn't be an issue - unless there is a performance problem pulling that much data across the network. If that is the case, it doesn't matter what method - you are going to have the same performance issue.

If you have a unique id and an updated datetime column Just pull the rows from the source with a greater than the max update datetime into your local staging table to process updates and deletes and do the deletes using a cursor to delete each row where the id doesn't exist in the source table. This way you minimize the data pulled across the network.

1 Like