Data Integration System

Dclipse03 · June 27, 2017, 3:55pm

I am currently working with a system that has been developed to consolidate data from 110 databases throughout the state. The data is consolidated from 110 individual schemas that reside in a replication oracle database schema. Data is retrieved nightly from these databases and repopulated into the respective schema. From there that data is then consolidated into a separate schema for statistics purposes.

The 110 databases throughout the state are currently all oracle databases and I am trying to convert the centralized server from oracle to sql server.

I have been looking into the different replication methods: Snapshot, Transactional, Merge, Change Data Capture and the possible ways that I could go about replicating the data (example: linked servers, database mirroring)

Would anyone have any suggestions as to what approach would be the best and the easiest? If I am just retrieving data one way from the oracle databases to the sql server, would change data capture be best to insert the data into the respective schema for that but then use database mirroring to insert the data into the statistics schema?

Kristen · June 27, 2017, 5:20pm

You can have a Linked Server and "pull" the data from the remote (Oracle) server. Provided you don't JOIN that "pull" with local tables it will be efficient. Pull it into a local Staging Table and THEN join THAT to other tables if you need to.

If the Oracle end has a ChangeDate such that you can pull ONLY Changed Records then you can use that to your advantage (you need to allow for deleted rows though - we find that pulling ALL the PKeys and comparing them against local, and deleting local records which are no longer on Remote, COMBINED with pulling only rows New/Changed-since-last-time is very efficient).

We use

INSERT INTO MyLocalStagingTable
SELECT *
FROM OPENQUERY(LinkedRemoteServer, 'SELECT * FROM RemoteDB.RemoteTable WHERE RemoteChangedDate > LastTime')

(Yeah, we don't use "*" and we include some more sophisticated WHERE clause stuff for the cutoff)

Just don't do

SELECT Col1, Col2,
FROM LinkedRemoteServer.RemoteDB..RemoteVeryBigTable
    JOIN LocalVerySmallTable
        ON ...

because there is a risk that SQL Server will pull ALL the rows from RemoteVeryBigTable local and THEN join them locally and THEN throw away most of the rows pulled from RemoteVeryBigTable

So we just keep a local StagingTable in sync (i.e. once a day, whatever) with the Remote and then further-process from that.

You can use SSIS to do all this. We don't because a-long-time-ago, when it was called DTS rather than SSIS, we thought that it needed a lot of support when "things went wrong" - e.g. data arrived with invalid dates, goofy stuff, JOIN to local tables failed, FKey issues. Maybe SSIS nowadays has solutions to all those problems.

Our routines that MERGE data from StagingTables into OurLiveTables has yards and yards of safety-net stuff built in (all of it mechanically generated). We check the size of every single column that we are importing to make sure it is not wider than our local table. If the remote database schema changes we will be the LAST to know and I presume that SSIS would just blow up. Our routine will log it as an error, truncate it (if that is safe) or treat it as a critical error and not import that row. End user sees the logs, and most things (goofy data) they can sort out without our help. Plus they are aware which rows have gone wrong, whereas an SSIS job that truncated something may never tell the user at all ... so that then relies on Someone Noticing and Someone Shouting. bad outcome IMHO.

For systems where the Remote DB tables have NO Change Dates and the WAN is "relatively slow" and REMOTE tables are HUGE then we do this:

Set up a staging server "near" the Remote Server. On the same LAN, preferable nearby on the rack!

We do all the above to get data from the Remote DB and freshen up our Staging Table, but we have to Pull, and then Compare, ALL rows and ALL columns in the table(s). Hence we want our Staging Server NEAR the Source data.

Our staging table has a ChangeDate, so the far end CAN then just pull "changed rows" to its own staging table. I wish we didn't have to have the expense of the extra server, but its a lot better than pulling billions of rows across the WAN every night, just to find the handful that have actually changed.

last thing. Everywhere where you reference a remote database (i.e. same server) or a remote server - which in SQL Server is referred to as 3-part-naming (SomeDatabase.SomeSchema.SomeTable) or 4-part (RemoteServer.RemoteDatabase.RemoteSchema.RemoteTable) then set up a SYNONYM. Changing all the 3-part and 4-part naming in your code is a Big Job. Changing a Synonym is a 30-second job (e.g. when a table moves databases, or a database moves servers, or you have finished testing on DEV and now need to move to PRODUCTION)

If replication will work that sounds great. But presumably it will replicate every change, and that can be quite a large burden - although maybe that only matters if a record changes several times in a day. Pulling the data overnight a) only pulls that changed row once and b) does all the heavy-lifting during night-hours

Dclipse03 · June 28, 2017, 8:18pm

Many thanks Kristen! I appreciate you taking the time to advise which route I should take. This has helped me tremendously!