SSIS/Automate Process Advice

Sonu619 · May 27, 2017, 3:53am

Hi Gurus,

I need advice.I am working on Project to Automate some manual project.

Here are the steps

Attach Oracle database.
Then Load the Data (almost 20 tables) from Oracle to SQL Staging Environment
Last step Load data from SQL Staging To SQL Prod

I kind of have an idea and plan to automate 2nd and 3rd steps.

My question is Is it possible to create SSIS Project to accomplish above steps? I am going
to use SSIS for 2nd and 3rd step.

Is it possible to create.BAT file when the user doubles click the .BAT file message or new
window pop up
Oracle Database Name = ?
SQL Server = ?
SQL Database = ?

Then user will enter above info. Once it's done hit SUBMIT and behind the seen.
Create Oracle DB/Load data from Oracle To SQL Staging/SQL Staging To SQL Prod.

Please, any advice would be great appreciated.

Thank You.

gbritton · May 30, 2017, 5:16pm

What do you mean by "Attach Oracle database"? Normally, your SSIS packages would use an Oracle driver to get the tables and handle things in a simple dataflow. Alternately, you could define the Oracle instance as a linked server on your SQL database and read Oracle data from SQL Server.

Kristen · May 31, 2017, 9:25am

I don't use SSIS, but maybe you could. We want a high level of control over any Goofy data - nice log tables which have user-friendly messages, and which the User can review to sort out any problems (we don't want that to be our job ...). Maybe you can do all that in SSIS, but back when I was a boy you certainly could NOT and clients I worked for spent several hours of every day sorting out the SSIS (or whatever it was called back then ... DTS maybe??) that had failed overnight.

For us:

We expect to have a Change Date/Time on the remote DB's tables. Of course not everything has that, and where that is not available the process takes MUCH longer.

We pull - using OPENQUERY(MyOracleServer, 'SELECT ..,.WHERE RemoteChangeDate > LastTime') - all the changed rows into a ##TEMP table, we then delete any that exist in STAGING table (based on PKey), and then insert the lot with two additional columns: ImportedDateTime (local time) and Status=1 (Row exists in remote; other values are 0=No longer exists on remote and 2=Exists on remote, but Reprocess)

We then pull all PKey values (only) from Remote (into a different ##TEMP table) and any in local table NOT in Remote (and not already set to STATUS=0) we change to STATUS=0 and set the ImportedDateTime to local NOW time. (These are recent deletions from Remote; if the Remote uses a Flag instead of physical delete this isn't necessary of course - although might still be desirable in case of RESTORE of Remote DB).

We then UpSert from Staging Table into Production. We do this using a VIEW onto the STAGING table which "maps" the columns to the ones in our Production table and does all "straightforward" data manipulation. Its easy to change the VIEW if some additional twiddling is needed, more effort needed to change the UpSert Sproc (same logic, but more testing effort IME), it also makes it easier to mechanically generate the UpSert Sproc - if it contains "no bespoke code"

The UpSert Sprocs UPDATE any row where Staging Table ImportedDateTime is more recent than "last time", INSERT all rows that are new (PKey exists in STAGING and not in Production, and STATUS>=1), and then delete anything from Production where PKey in Staging has STATUS=0 [or flag the record as deleted, instead, if that is how the local table works]

We store the "last time" value in the Production Database - so if Production is restored the Import resumes from the correct (earlier) time. If STAGING is restored then, at worst, we will get some rows coming through again - the UpSert from STAGING to PRODUCTION is based on the local ImportedDateTime, not the remote table's ChangeDate, as we prefer that it is under our control.

The value used to update "last time" is the MAX ImportedDateTime of the staging table at the start of the run. This handles an Async refresh of Staging data WHILST the UpSert is running (Grrrrrr!)

Our UPDATE of Production does a full comparison of every column (including detecting NULL in one or the other, and CASE SENSITIVE changes to data along with different numbers of trailing spaces), and only updates rows that have actually changed (i.e. in a column that we are interested in). This may not necessary for you, but we have Triggers on Production tables which we don't want to fire if there is no substantive change to the data.

Our staging tables are identical mirrors of remote (no data manipulation between Remote and Staging), so that if we find that our UpSert process has an error (likely in the early stages, at least!!) we can fix that and just re-run it against Staging - no need to pull fresh data from Remote, as that tends to be a slow process and can also have severe impact on performance of the remote. But we only include columns in STAGING that we actually need - or think that we MIGHT need!! If we find we need more columns we re-populate just those additional columns (in STAGING) with a manual UPDATE rather than forcing all rows, and all columns, to "pull" from remote. i.e. that update does not change the local ImportedDateTime on the staging table rows - we want that to continue to reflect when the row was originally acquired, because of a ChangeDate at the remote, not the fact we had to top-up some columns.

Note that if a row changes at the remote, and we pull it based on newer ChangeDate, we do not care if all the columns happen to be the same as we already had, we replace the row in STAGING and set its ImportedDateTime to now. One of the support questions that comes up is "when did this row change", so controlling the setting of ImportedDateTime is useful in that regard.

If it is important to know about changes to STAGING I would put a trigger on it any insert any changes to an Audit table

If there is NO ChangeDate on remote we pull All Rows, Every Time, and do an Every-Column-Comparison to then only replace changed rows in STAGING (the new-row-insert and deleted-row-status-change is the same). This is horrendously slow, especially for large tables. The only way we have found to improve this is to put a "staging database" near the Remote server. Update the Staging Tables in that DB, based on all-rows-all-columns-comparison, but over LAN rather than WAN, and then pull from that Staging Table into our local Staging Table using our own "change date" column.

All our code for pull from Remote to Staging, and UpSert from Staging to Production (including the VIEW) is mechanically generated. We have evolved a template, over the years, and just apply that (with some crafty code ) to generate the code for the process. That is driven by list of the Columns, and their Data Types and Sizes, that we want to a) pull from Remote and b) UpSert into Production.

Our UpSert process has loads of Goofy Data Checks. For example, every single VARCHAR column is checked for length (against actual max column size in Production). All VARCHAR in our Staging Tables are declared as VARCHAR(8000) (or (MAX) of course), regardless of actual size at remote. This allows the remote to change size, over time, without us having to make a change at our end. But if the remote data arriving suddenly becomes bigger than out Production column we a) want to know and b) handle it gracefully. We are always the last to know about a schema change on the remote ...

Our code generator allows for either "Critical error" or "Truncate and warn" on each VARCHAR column. All rows being processed by UpSert have Error Number and Error Message columns (in the working #TEMP table) that are set for any over-size test etc., also for any additional data-validation checks - e.g. where Remote has a date as a string but we are converting to Date datatype, we want to check that the Remote string date is actually a valid date!!. After validation steps are completed all messages are copied to Record Validation Errors table (existing matching errors, for that Table + PKey, have UpdateDate reset to "now" and set to Active, new errors are inserted, any other errors, for this process, that are no longer in the batch [for the PKeys actually included] are set to Inactive - i.e. the data has been fixed for those). For any potentially recoverable errors (e.g. Foreign Key parent not found) the Staging row is set to Status=2 (Retry next time). Otherwise the failed row will not be retried next time, because its remote ChangeDate has not changed. Let's say that our UpSert has not created a Parent Row because of some simple data validation issue, and referential integrity in our APP has then prevented the Child Table rows being inserted too. If the Parent is then, subsequently, fixed on Remote it will import (ChangeDate is newer) but the child rows will not have changed. Thus Status=2 in Staging takes care of that. Each time rows from Staging are processed any Status=2 is reset to Status=1 (and then back to Status=2 if the error condition persists)

Blimey, seems like I had rather a lot to say on the matter ...

yosiasz · May 31, 2017, 7:48pm

why do you want to bring the data to SQL Server? what is the end game here?

Sonu619 · May 31, 2017, 9:08pm

Thanks "Gbritton/Kristen and Yosiasz" for your comments.

Kristen I will take a look and try to understand. How can I use this logic in my environment.

I received oracle .back or dump file from client. So I have to create a new database into Oracle and then Load the data from
Oracle To SQL.

Yosiasz, End game here is bring the data to SQL from Oracle.