Periodically copy data from one database to another in sql server 2014

Anjitha · January 18, 2017, 5:12pm

I am working on a project which involves a client database to which we have read-only access.
That database contains around 7500 tables of which we need to access around 200 tables with millions of rows in each of them. Data needs to be copied to a local database for using with our project. The main database does not have primary keys and most tables don't have datetime type columns to help in filtering.
There will be only insertions and updations in the main database, no deletions.
The requirement is to periodically update our local database with the records inserted/updated after last update.

One way which I found was
insert into 'OurDatabase.Table'
select * from 'MainDatabase.Table' where condition
But my doubts are

How the condition must be written so as to capture last inserted/updated records without datetime columns
Whether the updation will take too much time if many transactions has occurred between updations

I checked Change Tracking but since there main database has no primarykeys and there is no permission to modify main database , had to drop that option

What other alternatives are possible to update database with latest records

jotorre_riversidedpss.org · January 18, 2017, 6:51pm

without a last updated timestamp it will be impossible without truncate and insert. Try replication, although that sill requires modification on the source database.

Anjitha · January 18, 2017, 7:06pm

Ok will try that

Thanks

Kristen · January 19, 2017, 9:34am

Option 1: Run away!

If there is a UNIQUE identifier in the table you could use that as your filter for "what is new", but it won't help with "what is changed". If, perchance, you need new rows more urgently than you need changed row then that might help.

You can use a Check Sum to figure out which rows have changed (more easily than comparing every single column), but Check Sum has a tiny chance of giving a false negative

I take a very simple view in these circumstances. I require that our end knows when the row was changed. It always comes up in debate when the client is pointing fingers, and we have always saved our own time, and face, by being able to say "This last changed ON xxx" and thereby diagnosing where the problem stemmed from ... if the other party has no ChangeDate and no PKeys then they have JackShit idea of what changed / when. So we have a DateTime column for "We last received a MODIFIED version of this row ON". We use a staging table for this, and then further process from the staging table into our own tables. (So staging table is an exact replicate of the remote, but only in columns that are of interest to us (which might be all of them ...) (**)

Where there is no ChangeDate in the remote table we pull the whole table into a scratch-pad table, compare every single row and column, and update only those rows that are different (with [ChangedOn] set to current date/time. (***)

We can then further process that [into our own tables] filtered to "Only rows which are new/changed since last time"

So basically we have "Figure out what has changed between Remote DB and our staging tables" and then "Map that onto our APP DB's tables / columns". We mechanically generate all that code, including all validation logic (e.g. "remote column values have become longer than local column"). Sounds like, with hundreds of tables to do, it might be worth your while doing the same. We a template of how we would like the ideal code to be, and then the actual tables and lists of column-actions are just "injected" into that so whenever possible we make zero changed by hand - its a boring job, computer does it consistently every time, of course.

(**) possibly pertinent point. We take the view that the staging table should be an identical replica of the remote. We NEVER use some massaging of columns when pulling data from the remote. That means that when we find a bug in the upstream code we can fix that, rerun the UPDATE (from ChangeDate >= '19000101') and we don't have to pull ALL that data AGAIN from the remote. If we need an additional column then that is a big query from the remote.

(***) If the remote is on a slow connection, or pulling TB over the wire will annoy everyone else! we mount a duplicate set of Staging Tables close to the source DB. We freshen that up, with a full COMPARE EVERYTHING, only setting our staging table's ChangeDate column on rows that are materially changed (i.e. ignoring any columns in original tables that are not part of our process / staging table). We can then use that ChangeDate to ONLY move rows across the wire from remote staging tables to our local staging tables.

Kristen · January 19, 2017, 9:37am

Might be worth exploring whether somebody, not necessarily you, could put a trigger on the tables.

Passing you the whole row, when it changes, might be desirable but it would probably be fragile, so a better bet might just be to record Unique Keys and Change Date [i.e. for each, relevant, table] locally (e.g. in an adjacent DB on that server] and leave it at that.

Your "pull" of data could then use that ChangeDate info to pull only-modified/new-rows.

Anjitha · January 19, 2017, 11:11am

If a trigger is possible then I guess major part of the work would be reduced...
Gotta check if that might be possible

Anjitha · January 19, 2017, 11:24am

Thanks for the detailed reply and providing an insight into the problems and giving ideas about how to proceed...

The more we examine the remote DB , the more we wish to run away from this project
But unfortunately that's not an option

yosiasz · January 19, 2017, 4:08pm

How are rows inserted and deleted in these tables you are trying to keep track of? stored procedure? linq/entity framework?

DECLARE @ChangeTracker table( NewScrapReasonID smallint,  
                           Name varchar(50),  
                           ChangeType varchar(50),
                           ModifiedDate datetime);  
INSERT Production.ScrapReason  
    OUTPUT INSERTED.ScrapReasonID, INSERTED.Name, 'INSERT', INSERTED.ModifiedDate  
        INTO @MyTableVar  
VALUES (N'Operator error', GETDATE());

Then dump this data to your real table changes tracker

Insert into Audit_ChickenWings(allRequiredColumns, CHangeType)
select Col1, Col2, Col3, ChangeType 
  from @ChangeTracker

Then use this Audit tables that have datetime columns you use to key off. You process will have another table that keeps track of last synch with columns AuditTableName, LastSynchDateTime. On your next synch you will bring new rows where Audit_ChickenWings.ModifiedDate >= LastSynch.LastSynchDateTime and AuditTableName = 'Audit_ChickenWings'

We do this for updating our SOLR search database

Anjitha · January 19, 2017, 5:13pm

The tables to be tracked are part of an existing ERP system, which is why only read only access is provided. Insertions and updations might be done through stored procedures.
The existing system cannot be modified which is why this is turning out to be a nightmare

yosiasz · January 19, 2017, 5:43pm

that is like asking "Please prepare a 7 course meal. But we can only provide you wooden cooking utensils"

I say Option 1: Walk away!

Anjitha · January 20, 2017, 9:26am

Ifor · January 20, 2017, 2:12pm

I find it very difficult to believe that tables in an ERP system do not either have an actual or implied primary key. If the tables do not have a primary key I suspect they will either have an unique constraint or an unique index instead. The first thing I would do is to ask the client for all the DDL of the indexes on the tables you are interested in.

Some ERP systems have license restrictions on direct access to the DB. If this is the case, your client might be able to setup transactional replication or they may just have to send you regular BCP dumps of the tables.

Anjitha · January 20, 2017, 4:56pm

I know it sounds difficult to believe there are no primary keys, the more
unbelievable fact is whose ERP it is.

They have given linked server as an option today

Need to check about it...

Kristen · January 20, 2017, 5:05pm

I have a home-brew script for "looking at new databases" that does things like:

List of Tables, and Columns for each table, with Type, Length, etc.

For each Table : COUNT(*)

For each table.column name:

COUNT of (non-null) rows
Number of distinct values (unique if it matches COUNT(*), but matching COUNT [if less] is also helpful
Min/Max actual values
Min/Max actual length (for VARCHAR columns)

no doubt some other stuff too, long time since I've used it and about to dash off, but I could have a look next week if it is of interest

Anjitha · January 21, 2017, 12:17am

That would be great, hope it won't be too much trouble
Thanks Kristen