Compare Tables in SSIS

How would I compare a Vendor table to our internal staging table and get the exceptions of records in SSIS? The tables are exact structure and schema.

In SQL, I would use an EXCEPT with a link server but I am trying to avoid the link server. How would I do this in SSIS between two data connection (vendor & our staging database)?

In what format are you receiving the vendor table? If you can import it into SQL Server and do the comparison in a query, that would be the fastest. Comparisons from a linked server for tables with anything other than a few rows perform poorly.

I am using a SQL Authentication connection to the vendor's database JoTorre. Their database and server is read-only to us so I cannot create a SP on their server to push data to our server. If I create a SP on my server, then of course, I would have to use a Link Server.

I have a feeling this will be faster in SSIS to avoid Link Server.

So how would I do this. Lets say the connection to the vendor is called Source while our connection is called Destination. Both connections Source and Destination has a table called "Table A" with a primary key called "Key1".

I would like to compare Source Table A to Destination Table A. If there are new records, then insert into Destination Table A. If there are changes to existing records, then update the records. If records in Destination Table A no longer exists in Source, then delete.

Any links to tutorials would be appreciated.

In ssis 2 connections
1 To your server
1 To their server

Dump the data from each to your staging. 2 tables -> Name the tables with compare.vendor compare.babyaqua. run and test to see if data dumps. Then on staging use SQL or a sproc to compare differences

Are there an updated and unique identifier columns in the tables you're comparing?


JoTorre, yes, there is one field that is an identifier and in some case, there are two fields that are an identifier.

Yosia, I got the two connections established but how do I dump it to a staging. Regardless, that is lots of data to dump, more than a million records.

What is your end game, what are you trying to accomplish?

I think you are trying to create a whole synch ecosystem. It would be very difficult to details every single step you need to do especially when dealing with millions of rows. But a shortened version of things

  1. Create 2 dump tables
  2. In your SSIS package canvas create a data flow task (go to menu and choose SSIS and select ToolBox)
  3. In this Data Task Flow (double click and go into it)
    a. create a source and point it to your vendor (SSIS ToolBox) Double click and point to vendor connection and select database
    b. create a destination table in staging (SSIS ToolBox) Double click and point to staging connection and select database

As far as the millions of records, if you need to do a comparison you will have to decide what subset of data you would like pulled, maybe a parameter that filters the rows from vendor. etc

Are you able to add a logging table on vendor db to keep track of changed rows with PK and changed date audit column?

Not sure I am following what you are trying to accomplish - and whether or not your 'staging' table is actually being utilized the way I think.

If the goal is to extract data from your source system - and only insert/update changed data into your destination - then there are a couple of options:

  1. Extract from source - load to staging table on destination - MERGE to target
  2. Extract from source - load to staging table on destination - UPSERT to target
  3. Extract from source - use SCD transformation in SSIS to INSERT/UPDATE to target

For Items 1 and 2 - truncate the staging table - use a simple data flow from the source system to the destination system into a staging table. Use a SQL Execution task to execute a stored procedure on the destination system that performs either a MERGE or UPSERT (INSERT/UPDATE) to the final target table.

For Item 3 - use the SCD transformation, define the source table/query - the destination table - the defined business keys and SSIS will build the data flow as insert/update operations based on the existence of those business keys in the destination.

For any of these - you need to evaluate whether or not you want to update the data that already exists - and what determines that the data should be updated. For example - for MERGE you would add criteria on the WHEN MATCHED:

    target.column1 <> source.column1
 OR target.column2 <> source.column2

For an UPSERT:

SET ...
FROM target t
JOIN source s ON s.key = t.key
WHERE t.column1 <> s.column1
OR t.column2 <> t.column2

FROM source s
WHERE NOT EXISTS (SELECT * FROM target t WHERE t.key = s.key)

For the UPSERT method - I would recommend putting the whole thing inside a try/catch block with error checking to about the transaction if either the update or insert fails.

Extracting a million rows from the source shouldn't be an issue - unless there is a performance problem pulling that much data across the network. If that is the case, it doesn't matter what method - you are going to have the same performance issue.

If you have a unique id and an updated datetime column Just pull the rows from the source with a greater than the max update datetime into your local staging table to process updates and deletes and do the deletes using a cursor to delete each row where the id doesn't exist in the source table. This way you minimize the data pulled across the network.

1 Like