Best design for ingesting and correcting data from multiple 3rd party sources

rerichards · October 19, 2023, 12:43pm

We have a number of manufacturing plants in which operators manually enter test data which is saved into a third party software system (database). We then extract that test data, stage it, transform it, cleanse and validate, then load/ingest into a table for reporting. It is NOT a data warehouse. It is an OLTP schema.

Keep in mind this test data comes from different software, different databases systems, depending on the type of test. Also, if the manually entered data is bad, say it was fat fingered, it gets quarantined in our cleanse/validation process. However, we cannot catch all errors and that bad data gets ingested into our reporting table(s).

We are not able to make corrections on the source software database at the manufacturing plant and re-ingest. Therefore, once the test data has been extracted successfully from the manufacturing plant, we no longer care about that data at the manufacturing plant from a correction standpoint, only being able to relate (potentially) what we have in our system locally, compared to what was originally entered at the manufacturing plant for auditing purposes.

We are trying to design a long range solution for correcting bad data entry. Currently the two proposed solutions:

Upon extracting the raw data from the multiple manufacturing plants, store the raw/native, untransformed/uncleansed data in a central, permanent quasi, source of truth table. Then if data is found to be bad in our downstream cleansing and validation processes, we re-ingest by making corrections to the quasi source of truth table.
The second solution we are entertaining is NOT having any quasi source of truth centralized table, but if data has been discovered to be bad (whether caught and quarantined, or not caught and fully ingested), the corrections are made directly on the ingested OLTP reporting tables. That is, there are no upstream corrections (except for the quarantined records). All corrections are made on the ingested data directly. The thought being, as long as we can still relate the corrected record back to the true source record, meaning, the actual original record on the manufacturing plant software, this would be okay.

Does anybody see any gotchas in either scenario that might bite us in the behind later on down the road?