We build Intranets which pull data from numerous databases within a Client's company and then provide a consistent enquiry system for all users.
The way we "pull" that data from various 3rd party source databases is as follows:
We create a STAGING database which contains staging tables for each table (of interest) in the remote database. Each staging table can have fewer columns than the Source database table (i.e. just the ones of interest) although we find that, over time, more data tends to be required, so if space is not a problem, and there are no wide columns which are not needed (e.g. "Notes") - because they will take bandwidth to populate at each refresh - then we pull all columns.
We have situations where the license on the remote database does not allow direct connection (i.e. license for direct query is expensive, and not purchased by client) so in those instances data is exported to file by BCP (make sure you use Native Format). In those instances we import BCP file to temporary table, with structure exactly matching the BCP file), and then we freshen up the staging table from there.
Same would be true if the Source was not a SQL Database, but rather a CSV file or whatever.
We only update the staging table with changed rows. Sometimes the remote table has a ChangeDate column (Good!), so we only query newer-than-last-time rows. Make sure that the value you store for "last time" is in the relevant database - usually the STAGING database - so that if that is restored, to an earlier date, the subsequent "pull" of data restarts from the correct date/time.
Where there is no ChangeDate column (Bad!) we pull all rows into a temporary table.
We then compare every row, every column, but ONLY the columns of interest in the staging table, using BINARY COLLATE (to catch upper/lower case differences) and freshen up our staging table for rows that have differences.
Each staging table has a couple of additional columns:
Action = Inserted / Deleted - if the row no longer exists on Source then set to Deleted, otherwise Inserted. (That same row may be deleted today and come back against tomorrow)
Change Date/Time - the date/time when the row was last changed (i.e. Staging Table Create/Update date/time)
Error Flag - set if there is a subsequent import error importing that row. That is so that the row will be re-attempted tomorrow, even if not changed (e.g. a foreign key error that prevents import of this row today, which may be solved tomorrow such that the row would then import).
IDENTITY - a simple, unique, ID that can be used in error reports etc. avoiding complexity of multi-part PKeys etc.
Staging tables have Unique Clustered Index matching Source Table. (Usually PKey)
Staging columns will generally be oversized - so perhaps all
varchar columns will be
varchar(8000) so that there is no risk of a size-increase on the Source database going unnoticed, or causing a runtime error.
If we have a remote database with no ChangeDate column on the tables, and a large amount of data, we host a STAGING database "near" that source database, freshen up the rows (in the staging tables) as normal, and then create the same tables on our (local) STAGING database, and then pull changed data into that from the remote [staging database]. We do this where the bandwidth to pull "all rows, all tables" is a problem, so doing that locally, at the remote location, is fast and then just pulling the changed rows in the staging tables is quick.
We do all this because the key for us is diagnosing issues once the system is live. Client says "Such and such is wrong" and we can see exactly when the relevant rows were updated in the staging table and thus easily diagnose what should have happened, and what actually happened. More often than not we find the problem is in the source data, not our processes. But its always "our fault" because the problem comes to light at our end ... <sigh>
We then have a series of UPDATE Sprocs that apply the staging data to our target database. These deal with any transformation of the data, including if source data needs to be combined from two/many tables into a single table or vice versa.
The Sprocs are built pulling data from a VIEW, where the VIEW is created to "represent" the Source data in the Target table column layout. We tend not to perform datatype transformations in the VIEW because we need to capture any errors, rather than have runtime errors.
So a view may look like this:
CREATE VIEW MyView
SELECT [TargetCol1] = S.Col2
,[ TargetCol2] = S.Col7
These VIEWs have proved very useful in debugging (rather than having the SProc just query the Staging Tables and do the transformation to Target Table columns. For example we can do
SELECT TOP 100 *
WHERE SomePKey = 1234
and see the data presented with exactly the same column names / sequence as the target table. That makes it easy to see that we have, for example, misunderrstood that column "XXX123" is not the customer number, but rather than address number ... very few of the 3rd party Source databases we pull data from have obvious Table and Column names
All our "UPDATE Sprocs" are mechanically generated. We have a data table of columns that are requried, with some meta data, and basically an UPDATE (for existing rows) / INSERT (for new rows) and DELETE (for stale rows). There is a transformation that checks validity of the data - so if the Target column is
varchar(30) but any row is longer than that we associate an error with that row. The error is either critical, or not. (For a
varchar() usually it is acceptable to record the warning message and truncate the data. But where the source data is, for example, a Customer ID and there is no matching ID in our customer table, then we record a critical error and that row is not imported). There are thousands of potential errors (all mechanically generated). Every single
varchar() columns has a "length test", any "text date" has an "is this a valid date" test, and so on. There may also be some bespoke tests - for example in Europe we have to record the Sales Tax Registration number of the Customer in other European countries; that has a specific format and Checksum, and we apply that to the imported data. (Yeah, we do find that top-end accounts packages allow any-old-rubbish in there and our validation is the first time that the Client discovers they are storing garbage - and exposed to hefty fines by the Revenue ... but we also see State Names in the Country Field and junk like that, all completely unvalidated by the 3rd party database/application <Sigh2> )
So now the data has been freshened up in out database's Target tables, and any rows with errors are stored in our "Data Validation Errors Table". The User has tools to view those errors, and mostly they will then cleanup the source data and tomorrow night's import will then remove that error (actually we change the error record from "Active" to "Inactive", again so we can report on "What errors changed, when" because if we just delete it all knowledge of what the operator did is gone ...)
Data Validation Errors Table has columns for:
Source Table ID (which also has an ID for the Source Project/Database)
Import process ID (e.g. the SProc that was doing the import)
Record ID (IDENTITY)
Error Number GUID
Primary Key - concatenation fo all primary keys in "human readable form" - something good enough for the User to find the original record, and hopefully! to fix it.
Change Date/Time - last time the state of this error, for this record ID, changed
The User has the ability to mark certain Error GUIDs as Ignore - so we can suppress errors that turn out to be annoying / irrelevant or things that we are "working on".
Its a lot to type, to describe it, but all the code is mechanically generated, so its quick for us to develop. The key things, for us, is that it never fails, with a runtime error, from one year to the next. We are always the first port-of-call when a problem arises, and it is almost never our fault, so we want to keep that cost to a minimum. The revenue may be nice, but the "drop everything, this is really important" interruption is most certainly not ...
When a 3rd party changes their database schema, and we are the last to know (despite all the Change Control processes that the client imposes on both us and the 3rd party) our "This field too wide" error message traps catch that, the user sees that error for the first time, someone contacts us, and we do a standard "WTF" about the Client's Change Control processes!! and then modify our schema to match etc. and send the client a hefty bill ...