SCD 1 & 2 for dim_employee

Hey there, I've been struggling a bit with my dim_employee ETL design. Basically, the source table from the HR system has from/to dates already. When I used the start/end date logic for the SCD transformation wizard, I started having a lot of conflicts because I need to track history on quite a few columns, including the dates. Anyway, what I'm playing with now and think will work is to stage the data first to look for changes on non-current flag'd records and union it to a check to see if there are new records. I've just now come up with this idea after a couple of days of reading/searching and wanted to see if anyone had any thoughts/review notes. Without constraining to only new records in the transformation, every execution adds records irrespective of whether there are changes (I've mapped data types a dozen or so times and actually created the dim table as a replica of the origin table so I'm pretty sure it's not a stray data type). Any thoughts would be much appreciated!

SELECT e.employeeid, e.lastname, e.firstname, e.badgenum, e.status_type, e.employee_type, e.division, e.fromdate, e.todate
FROM employeesource e
JOIN dbo.dim_employee X ON X.employeeid=e.employeeid AND X.valid_from_date=E.fromdate AND X.current_flag='Y'
WHERE EXISTS (SELECT X.employeeid, X.lastname, X.firstname, X.badgenum, X.status_type, X.employee_type, X.division, X.valid_to_date
EXCEPT
SELECT e.employeeid, e.lastname, e.firstname, e.badgenum, e.status_type, e.employee_type, e.division, e.todate
)
UNION ALL

SELECT e.employeeid, e.lastname, e.firstname, e.badgenum, e.status_type, e.employee_type, e.division, e.fromdate, e.todate
FROM employeesource e
WHERE NOT EXISTS
(SELECT 1
FROM dbo.dim_employee X
WHERE X.employeeid=e.employeeid AND X.valid_from_date=E.fromdate --AND X.current_flag='Y'

		)

Please post schema for dim_employee?

Also please identify the columns causing conflicts? It will help us understand your logic

Sure,

CREATE TABLE [dbo].[employeesource](
	[employeeid] [int] NOT NULL,
	[lastname] [varchar](50) NULL,
	[firstname] [varchar](50) NULL,
	[badgenum] [int] NULL,
	[status_type] [varchar](30) NULL,
	[employee_type] [varchar](30) NULL,
	[division] [varchar](30) NULL,
	[from_date] [date] NOT NULL,
	[to_date] [date] NULL,
    PRIMARY KEY (employeeid, from_date)
)
GO

CREATE TABLE [dbo].[dim_employee](
	[dim_employee_key] [int] PRIMARY KEY IDENTITY(1,1) NOT NULL,
	[employeeid] [int] NOT NULL,
	[lastname] [varchar](50) NULL,
	[firstname] [varchar](50) NULL,
	[badgenum] [int] NULL,
	[status_type] [varchar](30) NULL,
	[employee_type] [varchar](30) NULL,
	[division] [varchar](30) NULL,
	[valid_from_Date] [date] NULL,
	[valid_to_date] [date] NULL,
	[current_flag] [char](1) NULL
)

Type1: to_date
Type2: firstname, lastname, badgenum, status_type, employee_type, division

Typically, from the source application, we'll get an end-dated employee record and a new current record in the same day.

I can do this pretty easily with a T-SQL Merge, but am trying to use the tools as-is. I can add an ole db command on the history pipeline to update the to date with the to-date from the source. I know that you typically should start/end date based on the ETL processing date for validity, but have two sets of start/end dates is proving to be a bit of a pickle.

Thanks for your consideration.

I think you should not use the records from the HR system. The fromdate and todate do not have the same meaning as the meaning of your dim. You should have 4 dates in your DIM.

If the have the same meaning then ignore the HR system :slight_smile:

Interesting perspective...I've read the opposite in Kimball's writings. The thing is, the from date/to date from the HR system is the actual "truth" of what occurred. The ETL is only stamping when the changes came into the data warehouse. In any event, thanks for your reply.

When you provide example data when the conflicts happens we can take a further look.

I get the need to provide sample data, I deal with this a lot myself. I'm asking more of a theoretical question. I'm not just trying to get past a particular problem, but rather, the core logic. Let me approach a different way. The source application's from/to dates are the actual effective dates for the employee. For example, an employee's status record might say he is a driver, who is active, in transit department from Jan 1, 2020 to June 1, 2020. Then on June 1, switches to paratransit department. Any combination of those status attributes can be changed independently, therefore, the HR system tracks the effective from/to date. I'm trying to figure out what to do with the original from/to dates...whether to just ignore them, as you suggested, or use them somehow. They are the source information. The ETL effective date is just saying when a key was effective. Anyway, I'll figure it out, thanks anyway. I'm probably looking more for a theoretical/philosophical answer than a technical one. I get the technical issues. I need to be able to tell staff that the dates are correct because they may not jibe with the source systems since the effective dates are derived during ETL.

Andddd...I think the solution is...it's really just Type1 SCD! There is already a history being logged. As new rows come in, they're already supplying the history, so no need to Output $Action a new record. Way simpler than I was thinking.

Thanks

What are the business requirements is really the main question I go by.
For example: badge #. Can it change if so do you have a business requirement to track this change ?
Jan 2021 to 2022 ot was x but 2022 onward it is y. Etc

@yosiasz, thanks for replying. All of the fields are Type 2 SCD. But as I mentioned, there's already history tracking in the source system which throws a wrench in trying to capture SCD 2. I can accomplish this through T-SQL Merge pretty effectively just as a Type 1 with some conditional logic based on whether or not the row(s) affected is/are current.