ETL - Late-Arriving Facts

Hi there, typically with ETL processes, you think about early-arriving facts and late-arriving dimensions. And you deal with this through using inferred membership. However, does anyone have any thoughts on late-arriving facts? With facts, you're loading your staging fact table with any records that have come in after the last load date (e.g., select column,... from table where orderdate>@lastloaddate). The problem with this is, what happens to metrics that are late emerging? This happens routinely in our line of business. Buses offload their server files to a staging area and then are ingested every day. But if a bus has to be changed out for maintenance and doesn't make it to the WiFi network, it could be several days or longer before any records emerge. I've always handled this by having a much longer lookback, but was wondering if anyone had thoughts on how to maintain the actual staging table. For example, would you select from the last 45 days, then look for new records, and load those to staging each day as well? Any thoughts would be appreciated.

Thanks!

If a bus stopped for maintenance on 12.01.2021 Dec 1, 2021 and it connects to wifi on Jan 7th 2021, this date should be recorded as the ingest date. Different than the ride date.

Explain to me how facts are arriving before dimensions in cases of buses?

And just so that we are on the same ride, please post sample dimension and fact schema?

Yosiasz, no, that's not correct. No one in the public or the executive team cares that the system screwed up and didn't deliver the data correctly. They want to know when people actually rode the bus. There's no business value in the data getting lumped into an incorrect date.

My question isn't on schema...I'm very aware that solving SQL problems requires a schema display, but this isn't a SQL problem, it's a theoretical problem. I understand how to write the queries to get the data needed. My issue is just a philosophical question about what to do with staging when you're only going back to the last load date. Typically, I routinely scan for records over the last 45 days (typically, most buses will have reported in within 30-45 days) and bring newly emerging fact records in accordingly. We've just been having an internal discussion about the Kimball method and only staging new inserts from the last load date.

you didn't get what I said but no worries, we can tackle this on the theoretical side. what do you mean by load date?

Yosiasz...sorry, I guess I didn't understand the initial question. I don't think I was concerned about the relationship to the dimensions, but I think I see what you're getting at. So I'll try to address. First, there are prescribed dimensions that are measured every day (bus routes, individual one-way trips, vehicles, drivers, dates, route-patterns etc). The fact information varies but generally, the transactions we're looking for are accumulated distance, accumulated time, number of boardings, number of alightings etc. Drill down, let's say Route 100 was driven by Bus 400 last Saturday. Bus 400 uploads its server files via WiFi at 3am on Sunday morning when it gets back to the barn. The CEO, COO, everyone and their mother wants to know what happened on that bus ON Saturday. Now let's say when it got back to the yard on Sunday morning, the bus died and wasn't able to connect to WiFi. It sits in the shop for 6 days and finally comes into WiFi range when the bus is powered up again. Immediately, it uploads the files to the central server, files get ingested into the application database and now we have details for Saturday's trip data.

So for your question on load date, what I mean is, for the ETL process, presumably, you'd have a lineage process that tracks the last time the ETL package ran. Say the ETL runs tonight and the data for Bus 400 is now available in the application database. If we're constraining the query to only look at the last time an ETL package was run (yesterday for example), we're not going to see anything for Bus 400 because the service date was from last Saturday, which is before the last ETL load date. A sample query to fetch the records for staging might be:

DECLARE @LastLoadDate DATETIME
SELECT @LastLoadDate=MAX(LineageLoadDateTime)
FROM dim_Lineage

SELECT date_column, bus_column, some_metrics
FROM SourceTable
WHERE date_column>@LastLoadDate

Does that make more sense?

gotcha. service date was last Sat 1/1/2022 and load date 6 days later would be 1/7/2022, the load date/ETL process date should be stamped 1/7/2022 and you should be fine.

you are conflating service date (when bus was running around, picking people up etc) with ETL process date.

you feel me playa? :wink: and what the heck are you doing on a Saturday asking about Facts and dimensions. go enjoy weekend and come back Monday

LOL! Yeah...that's not going to happen. I'm like 5 weeks behind schedule. Actually, I do understand the difference between service date and ETL date. I guess what I'm getting at is generally, in the ETL Lifecycle, from what I understand, you're setting the WHERE clause for staging to only bring back records where your transaction date (in this case, service date) is greater than the last ETL date. This works nicely for ERP systems where you have manufacturing occurring real time, sales being logged real time etc. But when you have lagging information, it's a nightmare.

Our current ETL does a constant lookback of 45 days to find emerging records as noted before. But there are a couple folks on the ETL team that are pushing to only look for records that occurred after the last ETL Date. I wanted to get some 3rd party advice before coming back to the team.

Appreciate your time on this yo.

this happened to us in the manufacturing world. they got hacked and it took 6 months to get the data dump csv files. I did not change a thing cause my WHERE clause goes based on IngestedDate which is today. transactions happened 6 months+ ago. Files & transaction date (shipping, test bed dates) were 6 months old but IngestedDate was 1/8/2022

No need to go back 6 months

That makes sense, but go one step back before the fact. The WHERE clause between source data and the staging table. I assume you're appending your staging table without truncating it. So on the search for new records from the source, are you also saying you don't have a date range you're looking back to compare to see if old records came in?

Here's an example of something similar to what I'm doing now:

MERGE StagingTable AS T
USING SourceTable AS S
ON S.ID=T.ID AND S.ServiceDate>=DATEADD(DD, -45, GETDATE())
WHEN NOT MATCHED
THEN INSERT (Id, ServiceDate, OtherColumns...)
VALUES 	(S.Id, S.ServiceDate, S.OtherColumns....)
ETC;

This isn't exactly the method but demonstrates the basic idea. This is only to add to the staging table and again, there's no truncation in fact staging tables.

It depends on your design. What is your source data? Files or bus to a db directly?

The vendor's application pulls bus logs through a file manager, dumps them into their ETL load process. Then processed into their dim/fact tables. This is a closed/proprietary data warehouse that doesn't include anything but the bus information. So we have to use their data warehouse as a data source for the larger enterprise. So basically, I'm querying their fact & dim tables as a data source for this part of our business.

ok very good, finally confessing all the truth the whole truth. that makes a yuge difference if you have little to no control over it. in your case I would go back as far as you can.

LOL..ok, thanks for your input. It's what I thought from the start, but wanted to make sure I wasn't going crazy with the debates we've been having internally.

Thanks!

It was obvious it was an internal debate and you was looking for backup :rofl:

lol...I'd have been open to another solution. I just couldn't figure out what else would make sense. Thanks again for your help.