Databases Sync

I have an SSIS package run on a job to sync between two databases (different servers).

After the job is done, how can I verify that both databases are exactly the same? Is there a script or feature that I can run?

If they are not exactly the same, how can I get the differences (tables, fields, values). How can I get a message if a task failed during the package run?

Exactly the same in what sense?
Data or schema or both?

Both

For schema comparison use sys.tables and sys.columns
For data look into binary checksum function
https://docs.microsoft.com/en-us/sql/t-sql/functions/binary-checksum-transact-sql?view=sql-server-2017

How often are you synching?
Why cant you just use backup and restore from server A to server B

Assuming the SSIS package has specific tables identified - and the process is performing a truncate on the destination and selecting all data from the source tables.

Is the purpose here to confirm the process worked - or do you not trust that the process is working as built? Not sure why you need to confirm anything if the process has been built correctly. This would be similar to asking: if I select all data from a table how can I be sure I have selected all data?

If your goal is to confirm the process ran successfully - then all you need to do is setup the agent job (assuming you are running this in SQL Server Agent) to notify on error. If there are any errors in the SSIS package the agent job will fail - and you would be notified.

If you have deployed to the Integration Services Catalog - a report will be available that you can use to drill down into the specific task that failed and see the actual error.

If this process is not a full refresh - then you have to determine if the extract query is correct and has all of the necessary information available to update the destination. For example - can rows be removed/deleted from the source? If so - how are those rows identified to be removed in the destination?

If you are looking to validate the results one time - then you can use a data comparison tool (Redgate, Visual Studio, Idera, etc...). But it seems you want to validate the process every time...

Thanks Yosiasz.

Hi Jeff,

We are working with a bank and they gave us two packages every quarter. One is for initial run and the other is for incremental run. From the look of these tasks in the incremental package, some tables are truncated but not all. Fast forward, we have gotten both packages to work and a job has been created for the incremental package. All is running but now we move to the validation process.

Based on your previous message, I think I understand where you are trying to help me understand in regards to the data. So lets hold off on the data for a moment and talk about the process.

The purpose is both, "the job has ran and the package has ran successfully" and the "data are the same". I did setup a job (runs every 3 hours) but where do I put my email in at the notification screen?

If some of the tasks failed in the package when it was ran by the job, how do I get notifications for these errors. Is this the report that you are referring to?

"If you have deployed to the Integration Services Catalog - a report will be available that you can use to drill down into the specific task that failed and see the actual error."

Where can I find this report?

Thanks Jeff

The Notifications table on the SQL Server Agent job. You have to create an Operator first - then you can select that Operator and identify when to send the notification.

In SSMS - in the Object Explorer there is a folder called Integration Services Catalog. Open that folder and drill-down to the project. Right-Click on the project and select the 'All Executions' report from the report sub-menu.

Very Awesome Jeff! I was able to get both to work.

What is the difference between Email and Page for notifications?

https://docs.microsoft.com/en-us/sql/ssms/agent/job-properties-new-job-notifications-page?f1url=https%3A%2F%2Fmsdn.microsoft.com%2Fquery%2Fdev15.query%3FappId%3DDev15IDEF1%26l%3DEN-US%26k%3Dk(sql13.ag.job.notifications.f1);k(DevLang-TSQL)%26rd%3Dtrue&view=sql-server-2017

BTW... Jeff, is setting up the Operator the only method to receive email for a JOB agent (fail or succeed)? My predecessors said that they setup email notifications but none were received. Today, I am the first to setup an Operator.

Yes - you have to have an operator in order to be able to setup notifications for an agent job. The drop-down on the notification tab only lists operators.

Note: you also have to have database mail setup and configured - and enabled as the notification system for SQL Server Agent.

Hi Jeff, it has been almost 2 weeks now. I realized that the SSIS package that the vendor send over to us is not going to work because they do not keep it consistent with their database. For some reason, the package is not producing an exact same amount of records between the tables for the two databases.

I just need a mirror image of their database, schema, tables, and indexes sync every hour. Is there an alternative solution to this?

Do you have any control over the SSIS package - can you modify it to meet your needs? If not - then there isn't anything you can do to get a copy of the data every hour unless they are willing to setup replication or availability group with a read-only secondary.

Are you saying they are changing their schema - because that does not sound correct. I would not think any vendor would be making lots of changes to the tables (adding/removing columns) - or adding/removing indexes very often.

The idea behind having a copy of the data - is that you can then create the indexes you need for reporting and/or extract processes. The indexes you have would not necessarily be the same as what they have...

As for keeping the exact same number of rows - does the package use a filter to select the data? Are there purge processes to keep a set amount of data available? It can be normal to setup only xx years of data for reporting - it could also be normal not to include historical rows if the tables are setup that way...

It really depends on the data and the reporting requirements.

What happened to good old backup restore?

He needs a copy every hour

1 Like

Correct, will need the sync every hour.

Hi Jeff, yes, I do have control over the SSIS package and it has over 100+ Data Flow Tasks. Big. Most are repeated table processes. A generic task would be:

Source Table (OLE)
Delete Old Rows
Conditional Split
Destination Table (OLE)

The package that they provide to us has two issues. First, some fields in the package do not match the fields in the tables, including data type. This has caused some issues. It seems that they have added/deleted fields in the database but did not update the package on a quarterly release. This is something that not much can be done here except like you said, I have control over it. Second, I was able to get the package to run and it shows successful but when compare the total records between tables, they are not the same number. Scary thing here is the package ran fine but we are missing records. In some cases, large amount but I wonder if the "conditional split" has something to do with this? You asked about a filter. No, I do not see a filter. The only where clause that I see is it updates based on the latest (differences).

"with (NOLOCK) where modifiedutc > ? and modifiedutc <= ?"
the first parameter is "Last Modified" and the second parameter is the "Most Recent".

If this is supposed to be a 1-1 sync, why would it need a "Conditional Split"?

It wouldn't need a conditional split - unless it is using an SCD (slowly changing dimension) to identify rows to be updated vs rows to be inserted.

The conditional split would redirect new rows to a SQL task to insert - and existing rows to a SQL task for update.

I am confused by your statement that it updates based on the latest...for a destination you select the table. You can put a where clause on the query to select the data. If the source is an OLEDB Source - and it is using a SQL command with a where clause - then this control flow is NOT extracting all data from the source system. It is filtered based on the parameters provided.

If you are 'missing' rows that could be caused by many issues. The parameters are not calculated correctly...the package skips a run so it is now missing an hour...the modifiedutc column is not populated when this runs...etc.

It does not sound like this is using an SCD because it is deleting old rows. So whatever condition is being applied for the conditional split is causing rows to be excluded and that is definitely going to show up as less rows in the destination than in the source. Whether that is correct or not - depends on why the rows are excluded.

Hi Jeff,

I am not sure why they have conditional splits all over their tasks in their package. And you are correct with your third paragraph on "It is filtered based on the parameters provided". However, their is no actual "UPDATE" actions. Only SELECT and DELETE.

I am going to look into the two areas that you mentioned, missing a job run and the conditional split.

Thanks Jeff.

Your expectation is that this process refreshes the table(s) completely - but the process has not been written that way. It appears that rows are excluded - either based on how the parameters are defined or what the conditional split process is actually doing...

Either way - validating by row count isn't going to work until you identify what rows are being selected for extract and actually inserted into the destination. Once you have determined that you can then validate that all expected rows have been inserted into the destination.

I totally agree with your assessment Jeff. Like you have said with conditional split in here, the expectation of having the tables to be exact matches cannot be feasible. I am researching into those conditional split and will update this thread. Thanks Jeff.

Hi Jeff,

I would like to setup a new SSIS package attached to a job (running every 30 minutes) to copy the differences between the tables (Source and Target). Each table do have a date field. There are about 300 tables in this database.

Is there a fast and convenient way to loop through all 300 tables? Or do I have to build out a task for each table?

As for the task itself, do you recommend setting up an OLE DB and write a SELECT statement like my previous posts? If you have a sample package that you can share, I would really appreciate it, like always!