Merging 2 Databases..?

Morning all,

I have been asked to merge 2 databases. The 2 databases are located on 2 seperate servers, identical version of SQL and have an identical schema. Looking at it there are only a couple of tables that I'd need to copy over problem is they are relational.

i.e. 'employees' table has a primary key 'Emp_ID' which is linked to the 'notes' table so we know which notes are associated with which employee, hope that makes sense...

I've never done anything like this before, I'm not eve sure 'Merging' is the correct terminology. Any pointer and/or links to reading material would be greatly recieved.

Many thanks

Dave

this is gonna be fun, make sure to have a flask of whiskey under your desk
How much data do you have?

  1. backup and restore each databases to new server
  2. when restoring make sure to name each restored database distinctly. maybe prefix with source server name
    server01Quattro
    server02Quanttro
  3. Create a brand new database and name it Quanttro for example
  4. Create all the necessary tables in this new database Quanttro
  5. Then use either SSIS or basic vanilla tsql to ingest into new database Quanttro a sort of ETL, keeping the relationships intact

Presumably the ID numbers allocated to Emp_ID in each, of the two, databases "overlap" and "clash"?

Winding the clock back it would have been better that one started at 100,000 and the other DB at 200,000 - but assuming that did not happen:

You could just "add" 100,000 (or whatever is "enough" to avoid any clash) to the Emp_ID from the database being imported to make them unique.

You could have to also add 100,000 to the Emp_ID in the Notes table, and so on.

You could also, perhaps?, make all existing IDENTITY "negative" on import. Personally I find having negative ID numbers is not idea - sometime people fail to see the "-" when they look at them ... and of course if they ACTUALLY appear on any End User reports that doesn't look very good!!

After you've done the Import you need to change the "next available IDENTITY number" on that table to be after the now-largest value (maybe, instead of next-sequential-number, you would set it to 200,000 so that it is clearly "after the merge")

Its not quite that straight forward. You can do

INSERT INTO MyTable
SELECT *
FROM MyRemoteServer.MyRemoteDatabase.dbo.MyTable

to insert all the rows from Remote into Local. But that won't work if you have an IDENTITY column in the table, instead you have to do

SET IDENTITY_INSERT dbo.MyTable ON
INSERT INTO MyTable
(
   Complete list of ALL column names
)
SELECT *
FROM MyRemoteServer.MyRemoteDatabase.dbo.MyTable
SET IDENTITY_INSERT dbo.MyTable OFF

but that won't get you a change in your ID number, so you would have to do:

SET IDENTITY_INSERT dbo.MyTable ON
INSERT INTO MyTable
(
   MyID, Complete list of ALL other column names
)
SELECT MyID + 100000 AS MyID
       , Complete list of ALL other column names
FROM MyRemoteServer.MyRemoteDatabase.dbo.MyTable
SET IDENTITY_INSERT dbo.MyTable OFF

Other than that you just have to import the Employee table before the Notes table (so that the new Emp_ID exists before you try to insert the Notes rows).

Do you need to do this "again" or is it a one off?

Doing it "again", so that you accommodate "changes" is much more tricky. You have to delete anything no longer present in RemoteDatabase, add new rows, and update all the existing rows (or just the ones that have changed to speed up the process)

You need to do a Dress Rehearsal to prove that your code is correct, so you probably want to make a copy of your Target database (using BACKUP and RESTORE-to-new-TEMP-name), practice the IMPORT from RemoteDatabase until you have it correct (repeat the RESTORE before starting each new import-test run)

If performance, when referencing the database direct (using FROM MyRemoteServer.MyRemoteDatabase.dbo.MyTable) is bad, then take a BACKUP of that database, ON that server, COPY the backup file to Target server, and RESTORE to TEMP database on Target Server - and then you just need to use

FROM MyTempCopyDatabase.dbo.MyTable

instead

Once you create the empty database I also recommend you do this to manage the ingestion since as Kristen indicated there could be duplication of identity column values

  1. when creating all of the tables in the new database I recommend the following
create table employeesStaging(employeeid IDENTITY(1,1) NOT NULL, **Emp_ID** int not null, sourceserver varchar(50),  ...

So you first stage things in employeesStaging. When ingesting the employees data from both server01Quattro, server02Quanttro you preserve the Emp_ID values that you can use as reference to the respective sourceserver. The sourceserver is there for breadcrumb purposes and in case there are, which I assume there will be, idential Emp_ID across both servers.
Also you will have to make sure you do not create duplicate employees. There should be something that uniquely identifies them such as socsec# or some other unique value that should not be across servers.
So you do the same thing for other tables by staging them first and populate it you final tables

I would keep the original ID numbers from one of the databases and only allocate new ID numbers (or "adjusted ID numbers") to the other (rather than, for example, creating new IDs across the board)

Keeping the original table would mean not adding any new columns for "Original ID number", but you might choose to add a column for "original ID number" during the Import so that you can easily convert, for example, the Emp_ID in the Notes table from "Old" to "New" whilst importing that.

But if you can just add 100,000 to make the "New ID" then I think that is easier [than having to have Old / New conversion columns]

I think a few basic questions need to be asked before doing anything:

  1. Are these two DBs two HR type DBs or are they DBs which have imported employee information from a single HR DB?
    (ie Does a given emp_id represent the same person in both DBs?)

  2. If they are two HR type DBs is there a clash of emp_id?

  3. If they are two HR type DBS, is emp_id just an internal number or is it reported externally?
    (eg To the relevant tax authorities.)

etc.

Morning Guys,

Thanks for all your replies & apologies for my late reply, I've been out unexpectedly of the office for a few days.

@Kristen: Correct, the ID's overlap & clash between the databases. I quite agree it would have made a lot of sense to start the ID's a fair way apart...

@Ifor: No, both databases, although with an identical schema, have grown up independently and contain different data. ie ID 1508 in one DB might reprisent 'Bob Smith' while ID 1508 in the other might represent 'Alexia Brown'...
No the ID number is not reported externally, DB & internal use only....

I must admit I love the idea of adding, say, 100000 to each ID. In my mind it makes perfect sense...

I've going to flash up 2 new instances on my test box and import the backups see if I cant get this sorted out....

Thank you!!

Dave

@ Kristen:

I managed to get some time today and did some testing on the test server, your script works perfectly, thank you!! Only issue i'm having is I can't work out how to correctly reference one sever from the other:

FROM MyRemoteServer.MyRemoteDatabase.dbo.MyTable

i've tried using netbios name & ip address neither seemed to work

FROM testsqlsrv.TestDatabase.dbo.TestTable
FROM 192.168.0.2.TestDatabase.dbo.TestTable

Do I need to reference the 'other' server somewhere in management studio..?

Many thanks

Dave

Yes, you need to set up a Linked Server, give that [linkage-object] a "Friendly Name" - which would be "MyRemoteServer" in my example - and then you can do the 4-part naming.

I strongly recommend that you use SYNONYMS for 3 and 4 part naming - then if the remote server name, or the remote database, changes (e.g. from TEST to LIVE) then you can just change the SYNONYM instead of all your code!! It would help if you had all the synonym table DROP / CREATE in a single script, and then you can just FindAndReplace "MyTestRemoteServer" with "MyLiveRemoteServer" and re-run the script (or run the script on your local Test database one-way, and on your local Live database another-way)

Don't JOIN a RemoteServer.RemoteDatabase.dbo.RemoteTable if you can help it (SQL is liable to pull all the data local and then figure out the JOIN at that point throwing away all the irrelevant data ... which has used lots of bandwidth to "pull"). If you need to do that consider using OPENQUERY instead (most particularly if you have JOINs from one REMOTE table to another table ont hat same REMOTE server) , or pull the relevant rows into a #TEMP table locally, create a suitable index to facilitate any JOINs, and then JOIN the #TEMP to your other local tables.

FROM [192.168.0.2].TestDatabase.dbo.TestTable

might work, but personally I would want to create a LinkedServer so that "192.168.0.2" had a more friendly / more obvious! name.

Linked server worked perfectly, it was a bit of a swine to setup but once set up it was fine. Databases have now been merged. Something else I know how to do now :smiley:

Thank you so much...!

Dave

Yeah, that's normal IME ...

... I have a script with "all of the things I've ever tried" in it as "snippets" which I find speeds up the process ... but every time I have to connect to Oracle, MySQL, PostgreSQL or ... worse .. "Something New" then it definitely takes a while of trial-and-mostly-error ...