Correcting fact table records that were changed by a user

flamblaster · June 26, 2023, 4:50am

Hi there,

I've got a situation I want to try to model and wanted to see what other folks have done in this situation. I've run into this before but the data set I'm working with has a lot of user error issues that I'm going to have to deal with.

My understanding with Kimball method dimensional modeling is that you don't want to update fact records, which I have been able to avoid thus far. And what I'm working with has to be atomic grain, so periodic snapshots won't really make sense.

Basically, in the source system, users may need to correct a date field or other fields that are not additive. I'll be getting updates not based on order date but on insert or update date of the source system.

I'd like to know what other people are doing when they run into a similar issue. So for example, let's say (modeled below) a user originally put the order date (there are about 7 date fields they work with, but simplified to 1 here), in as 1/1/2020. The data gets loaded into the fact table that night, and all is good in the world.

The next day, a manager makes the employee correct the order date to 12/31/2019 because the employee failed to do the data entry. The measurements may or may not have changed. I'm focused on the date for this example.

Would you logically delete? Delete? Add a new record to reverse it? Or something else?

I've writen an extremey oversimplified example below. Forgive me for not normalizing the source data. Hopefully folks can see what I'm trying to point out.

Thanks in advance!

--load a date dimension with the 2 dates that will be in question
declare @dim_date table (dim_date_key int, cal_date datetime)
insert into @dim_date (dim_date_key, cal_date)
values
(20191231, '2019-12-31 00:00'),
(20200101, '2020-01-01 00:00')

--assume the fact staging table was loaded with source data
declare @stagetable table (order_id int, order_date datetime, amount float(2), insert_datetime datetime, update_datetime datetime)
insert into @stagetable (order_id, order_date, amount, insert_datetime, update_datetime)
values
(1, '2020-01-01', 15.35, '2020-01-01 15:30', null)

--check values of staging table
select *
from @stagetable

--set up simple fact table
declare @facttable table (dim_date_key int, amount float(2))

--insert fact records from staging that don't exist in the fact table
insert into @facttable (dim_date_key, amount)
select d.dim_date_key, s.amount
from @stagetable s
join @dim_date d on d.cal_date=s.order_date
except
select dim_date_key, amount
from @facttable

--check values of fact table
select *
from @facttable

--user fixed a date field in the source system. Consequently, there was a new version of the fact record
insert into @stagetable (order_id, order_date, amount, insert_datetime, update_datetime)
values
(1, '2019-12-31', 15.35, '2020-01-01 15:30', '2020-01-02 16:00')

--insert new records that are not already in the fact table
insert into @facttable (dim_date_key, amount)
select d.dim_date_key, s.amount
from @stagetable s
join @dim_date d on d.cal_date=s.order_date
except
select dim_date_key, amount
from @facttable

--what would you do with the original record? It's not a measure issue and not a late arriving dimension issue. The dimension was wrong.
--Do you logically delete the original? Do you throw it out then reload? Do you update it?
select *
from @facttable

RogierPronk · June 26, 2023, 4:08pm

What you are describing is called a slowly changing dimension and you have different kind of types.

Take a look at this video for an excellent overview:

SCD: Slowly changing dimensions explained with real examples

flamblaster · June 26, 2023, 4:27pm

Well, no, I have multiple SCD 1 & 2 dimensions modeled in various places throughout the DW. It's the fact that changed. In this example, the version of the fact record on that ETL package load was incorrect and needs to be changed. A slowly changing dimension is where the dimension changes. The dimension didn't change, the fact did.

RogierPronk · June 26, 2023, 8:19pm

You can use the same strategy. If you don't want to use it then I would choose to update the record. Maybe I would use a Temporal Tables - SQL Server | Microsoft Learn.

In my opion it all depends on the question if you need to explain why the data today is different then a week ago. If you have to answer that question I would use the technique SCD 2.

flamblaster · June 26, 2023, 10:32pm

Thanks Rogier, but updates to Fact records are typically not best practices. It would just be an overwrite (SCD1) but this isn't a dimension, it's a fact. There are multiple reasons that updating facts are frowned upon, but was looking for thoughts from people who are encountering this in a DW now. I could theoretically find the correction, identify the row, and do an in-place update of the value. But worried about transaction logging during ETL and other potential problems with updates.

I appreciate your insights, thanks

RogierPronk · June 27, 2023, 5:34pm

Maybe this article is a solution:

5) BI - Handle History on fact-tables LinkedIn