MERGE dim table with no source PK or artificial id of any kind

flamblaster · July 9, 2017, 6:09pm

Hi, I’m working on an ETL process to pull data from a production database. Some of the source tables I’m using to populate dimension tables in a data warehouse don’t have primary keys (existing purchased product). I’d like to create a merge statement to insert/update/delete based on changes, but I’m a bit perplexed on how to approach this. I have two basic ideas, in idea 1), I’d just set it up as a normal merge, but the ON clause would essentially have just about every column in it, which would guarantee that any changes would be inserts, not updates. In idea 2) I would create a middle table to archive all the source data and add a primary key. Here are a couple table variables and process to illustrate the issue:

--Setup the sample source data
DECLARE @FeedbackCategorySource TABLE (FeedbackCategory VARCHAR(30), ReportGroup VARCHAR(30), FeedbackSubCategory VARCHAR(30))
INSERT INTO @FeedbackCategorySource (FeedbackCategory, ReportGroup, FeedbackSubCategory)
VALUES
 ('Complaint', 'Civil Rights', 'ADA')
,('Complaint', 'Civil Rights', 'Title VI')
,('Complaint', 'Civil Rights', 'General Discrimination')
,('Complaint', 'Daily Operations', 'Employee')
,('Complaint', 'Daily Operations', 'Vehicle')
,('Complaint', 'Daily Operations', 'Facility')
,('Complaint', 'Daily Operations', 'Bus Schedule')


--mock DIM table
DECLARE @dim_FeedbackCategory TABLE (dim_FeedbackCategoryKey INT PRIMARY KEY IDENTITY NOT NULL, FeedbackCategory VARCHAR(30), ReportGroup VARCHAR(30), FeedbackSubCategory VARCHAR(30),
									 update_time DATETIME)

--MERGE with sort of useless update??  Delete and insert seem to be the only value
MERGE @dim_FeedbackCategory TARGET
USING @FeedbackCategorySource SOURCE
ON target.FeedbackCategory=source.FeedbackCategory AND target.ReportGroup=source.ReportGroup AND target.FeedbackSubCategory=source.FeedbackSubCategory
WHEN MATCHED  --This seems useless since we've already joined on all the columns in the SET clause
THEN UPDATE
SET target.FeedbackCategory=source.FeedbackCategory, target.ReportGroup=source.ReportGroup, target.FeedbackSubCategory=source.FeedbackSubCategory, target.update_time=GETDATE()
WHEN NOT MATCHED BY TARGET
THEN INSERT (FeedbackCategory, ReportGroup, FeedbackSubCategory, update_time)
VALUES (source.FeedbackCategory, source.ReportGroup, source.FeedbackSubCategory, GETDATE())
WHEN NOT MATCHED BY SOURCE
THEN DELETE;

Anyone have any thoughts on which method to go with or any tips on how to make method 1 work?

Kristen · July 10, 2017, 6:44am

That's pretty much what we do. Some 3rd party stuff has a reliable ChangeDate column on each row, others have nothing.

Either way, we rarely want every single column, so we pull rows (with newer ChangedDate if available) and compare every column for differences and only "import" the rows that are new / changed [in columns of interest to us]. We Delete+Insert those into a Staging Table, with an additional ChangedOn column, which we can then use - reliably! - downstream for merging into our Production DB. If the source DB is huge, and remote, we site a Staging DB near that database, and then use our ChangedOn column to pull only changed rows to our local Staging Table [to keep WAN traffic to a minimum] and then merge into Production DB as normal.

Sooner or later we get a question about data which is easily resolved by knowing the date / time that we last "acquired" a row, and without our staging table and ChangedOn column we would spend a lot of time diagnosing that.

Dunno if this is the most efficient way of comparing columns, but its what we use it caters for trailing spaces and case-sensitive comparison of strings, and NULL on either Source / Destination. The code is all mechanically-generated

UPDATE D
SET
	[MyIntCol] = S.[MyIntCol]
	, [MyStrCol] = S.[MyStrCol]
-- SELECT TOP 100 'D', D.*, CHAR(13)+CHAR(10)+'S   ', S.*, CHAR(13)+CHAR(10)+'----------'
FROM	DestinationDB.dbo.[MyTable] AS D
	JOIN SourceDB.dbo.[MyTable] AS S
		 ON D.[MyKeyCol] = S.[MyKeyCol]
WHERE
	1=0
	OR (D.[MyIntCol] <> S.[MyIntCol] 
		OR (D.[MyIntCol] IS NULL AND S.[MyIntCol] IS NOT NULL) 
		OR (D.[MyIntCol] IS NOT NULL AND S.[MyIntCol] IS NULL))
	OR (DATALENGTH(D.[MyStrCol]) <> DATALENGTH(S.[MyStrCol]) 
		OR D.[MyStrCol] <> S.[MyStrCol] COLLATE Latin1_General_BIN2
		OR (D.[MyStrCol] IS NULL AND S.[MyStrCol] IS NOT NULL) 
		OR (D.[MyStrCol] IS NOT NULL AND S.[MyStrCol] IS NULL))

flamblaster · July 10, 2017, 6:23pm

Hey Kristin, thanks for responding! So if I understand correctly, you create a middle table (staging) between the source and final destination. In that staging table, you're auto-generating an artificial primary key to identify the record. Then you're using sort of a status (via date imported) to determine if it should go into the final table? I was hoping to avoid this because I just feel like once it's inserted to the staging table with an artificial key, if the source table changes, there's nothing left to recognize the original record, but I have some ideas now. Thanks!

Kristen · July 11, 2017, 6:30am

Yes, as you described, except I've never had to tackle a table without a PKey (well, probably have, but for sure it would have had a set of columns that was unique and, essentially, non-changing)

The Staging table always seems like a giant waste, but in practice we get asked half a dozen questions a year that it answers (usually relating to solving When? a particular record changed). We also have a FLAG on the Staging row that indicates 1) New, 0) Deleted [from remote] and 2) Error during import (to Production).

(2) gets retried (same as a row with newer ChangedDate would). This sometimes happens because a Parent record fails some validation test (FKey perhaps ...), tomorrow the Parent Record has been fixed, on Remote, so imports OK ... but the Child row has not changed, on remote, so would never import again. Hence we set it to (2) so it re-tries every batch until successful.

You might also want to add an ErrorMsg column, to store a "Why it failed" message. We have a separate table for that, and it allows end users to sort out any problems with imported rows, without using up IT time. Its a while ago now since I've done that sort of consultancy, but it seemed to me, at that time, that IT departments spent all morning figuring out why SSIS (or DTS as it was then) had failed, whereas for me the user does that for themselves by reporting on all rows with (new) error messages.