Design ?: gather data from multiple MSSQL databases

MSSQLN00b2 · August 5, 2015, 2:44am

I've been asked to assist a friend in producing some combined reports from several of his stores. The software he uses is a client/server .net winforms application connecting to an MSSQL db. There is one server per store but they are connected via vpn and the vendor has given him access to a readonly user to produce his reports.

We are using MSSQL Express with db links to the stores' databases. I am thinking that we will need a Store table to identify the source of the data being pulled back. Then I will populate the customer, product, manufacturer data while maintaining the store id. The problem I see is tying the customer data together to act as one customer; this will require using a table to identify the like customers and their external id.

I know MSSQL Express doesn't have a job agent so I will either schedule task manager to command line sql scripts or write a windows service. Any ideas are greatly appreciated.

Kristen · August 5, 2015, 7:09am

We use Staging Tables for this type of activity, but it depends on your bandwidth and the size of the database. Obviously if the database is small enough you can just pull the data over the VPN; our databases aren't! I see lots of places doing "TRUNCATE LocalMasterTable; INSERT INTO LocalMasterTable SELECT * FROM RemoteLinkedServer.RemoteDatabaseName.dbo.RemoteTable" which is incredibly slow and scales really badly of course.

We create an XXX_XFER database, on the remote machine, and include in it a copy of all the tables that we want to pull data from. We add two columns (to the start of the list, easy them to add more columns to the end if the APP changes) for Action (1=Created, 2=Deleted) and a ModifiedDate. We then update this table ONLY with rows that have changed. If the APP has a ModifiedDate then use that as part of the selection (but we find that APP modify dates are rarely 100% accurate) and then a massive WHERE clause checking every column for differences (ask me if you need an example, we have things in ours that treat changes to Case and Trailing spaces as a difference, as well as handling NULL-in-one-but-not-the-other).

We ONLY set our ModifiedDate if a column that WE are interested in changes. Thus a column such as "ViewedByUserID" might change frequently, but that might not actually be of interest to us, no point "pulling" that record back to base until an important column changes.

OK, so now we have some Staging tables with all the latest data, from the APP database, in it. We now pull changed rows back to our central database - again, an XXX_XFER database with identical columns. We can find the MAX(ModifiedDate) in our central table and pull any/all rows with a newer ModifiedDate back and update our central table.

We then, again using the XFER tables ModifiedDate and a stored "LastRunOn" date, process data into our central APP. In your case you don't really have a central APP, so the central XFER tables may well be good enough (i.e. as a Warehouse Database). (If you do want to combine all the store data into a single set of tables, with a StoreID in each row, then you could do that at this point, but it would be a selective UPDATE/INSERT/DELETE of only rows changed since last time - using a simple WHERE clause on the ModifyDate column in the staging tables)

Alternatively you could have one XFER database per store, rather than having a central master database with a "StoreID" in each row. Then a VIEW that uses UNION ALL to combine all the tables in all the warehouse databases - this type of Partitioning is very efficient in SQL, given the right unique keys and appropriate indexes, for example if you query StoreID=1 then SQL will ONLY interrogate the tables in the view that represent that Store. Downside is that you have to create a new database, and modify the View, every time a new store is added. If that is not very common then that might be the way to go, if they are opening a new store every week then it would be cost effective to automate that process

Am I understanding this correctly? You have a Customer, John Doe, who buys at Store=1 and Store=2, but he has CustomerID=1234 in Store=1 and CustomerID=5678 in Store=2 - and you would like to be able to aggregate those together?

Nightmare! Give all the customers Loyalty Cards and use the ID from that to aggregate them?

Address + Telephone etc. can help, but IME the variations in typing in the same address are huge - USA addresses look pretty simple, compared to rest of world, and also if phone numbers are only one country then they might be uniform - ours in the UK rarely are when typed in by different operators ... but converting them to "digits only" format might give you some matches.

We have found situations where multiple companies operate through an operating agent, so the same operating agent's phone number appears on multiple Customer accounts, for example. That can apply to address too - e.g. where a forward agent is used,

You will need a StoreID in every row (in customer, product, manufacturer data tables) so you can join on OriginalTableID + StoreID) hence why I would favour having separate tables (or separate databases) for each store and a VIEW that combines them:

CREATE VIEW MasterCustomer
AS
SELECT StoreID = 1,
       CustomerID,
       CustomerName,
       ...
FROM StoreDB1.dbo.Customer
UNION ALL
SELECT StoreID = 2,
       CustomerID,
       CustomerName,
       ...
FROM StoreDB2.dbo.Customer
UNION ALL
       ...

you can even include some Joined columns in there if you need to

CREATE VIEW MasterOrder
AS
SELECT StoreID = 1,
       O.OrderID,
       O.OrderDate,
       ...
       C.CustomerID,
       C.CustomerName,
       ...
       CCC.CustomerCommonNumber
FROM StoreDB1.dbo.Order AS O
     JOIN StoreDB1.dbo.Customer AS C
          ON O.CustID = C.CustID
     JOIN LocalMasterDB.dbo.CustomerCommonCodes AS CCC
          ON CCC.StoreID = 1
         AND CCC.CustID = O.CustID
UNION ALL
SELECT StoreID = 2,
       ...

If the individual store databases are reasonable modest in size you might be better off pulling a Full Backup File back to the central office and restoring it (each Store to a different restored database name) and then going from there with the UNION ALL View. Pulling individual changed rows (i.e. using staging tables as I described above) is quite a big programming job - you can throw something together easily enough, but IME that then proves to be fragile, breaks often enough to be an annoyance both to users and support staff!, and scales badly so becomes a performance bottleneck. A compressed backup may be very quick to "pull", and hopefully?!! the Vendors APP is already generating that on each store's server.

MSSQLN00b2 · August 6, 2015, 12:34am

Wow Kristen,
What a fantastic response! Well thought out with great ideas based on experience not theory. Thank you so much, this has given me some direction that I use to look at the problem again and decide the best approach. I'll let you know how it turns out. Very glad I found this forum! Cheers!