SQLTeam.com | Weblogs | Forums

Data Cleansing / Exception Reports


#1

I want to review processes we use for making sure that the data in the database is as clean as possible.

Our process is mostly:

Front end application has data rules. As the operator enters data they get a message explaining the problem ("That's not a valid date"), and they can fix the data there-and-then. Important for me is that the messages are self explanatory and that (for example) the application doesn't just terminate with a fatal, or even cryptic!, error.

Individual records (e.g. a single form completed by Operator) are saved through an SProc, and that performs further validation. For example, "Client Code not found". We have an OUTPUT parameter in the Sproc which allows a "validation error" to be returned to the user - i.e. "Record not saved" and "Here's a helpful explanatory message".

The Sproc only saves single rows, so is a performance disaster if we use it for bulk import, so we tend to "assume" that bulk import data is clean. Our import routine allows us to have conditional error messages, so we can check the import data for "Client Code not found" and attach an error message to all rows that fail the test(s) (and if the error is critical the row is not imported). The Operator can review the error messages, clean up the source data, and then the next import will succeed.

We do have FKeys and Constraints, but I've never found them useful other than as a last-line-of-defence. Catching an FKey error, and as a result then displaying a meaningful message to the user, has never seemed possible to me. Do you have a way of doing that?

That leaves other junk :frowning:

We have used two approachs:

Exception Report - apply every condition imaginable to find goofy data. e.g. in the Timesheet table "You can't have an Activity Code that implies Non Rechargeable Time on a Matter Reference (you have to use an Overhead Reference instead)"

Trouble with that is defining which Activity Codes are Non Rechargeable, and what is a Matter Ref and what is an Overhead Ref. This is for data imported from Some Other Database, which doesn't care about such things, so no specific attributes that I can use [in the Other Database]

So I'm now thinking that I need a table of some sort of Rules defining these things, so that a User Admin can adjust the rules, over time.

Its a very ambiguous thing though, so not sure how well I can build such user-configurable rules?

Data Validation Errors Table - a generic table which allows one/many error messages to be recorded against a Table + PKey. Then an exceptions Sproc, for each table needing validation, which stores error messages in Data Validation Errors Table (and removes previous errors that no longer fail).

The Exceptions Sproc allows a specific PKey (@Parameter) to be specified, or "All rows in table". Thus an APP that displays a Record Card for a specific row can call the Sproc to freshen-up any validation errors for that row, just before displaying the record card. Separately on a schedule, e.g. over-night, all rows are validated.

Operator can check for "Any new Data Validation Errors since last time" daily.

Are there different / better ways? Things that have worked well for you?


#2

Have you looked at Whats new in MDS, Why we need it and MDQS licensed with SQL Server? They do what you need and enable data stewardship by business data workers. this frees up the DBAs for other pressing tasks and fosters cooperation and coordination across IT and embedded data workers.


Common data in an enterprise
#3

Thanks, I'm off to take a look.

I've end to avoid Pinar's site. Articles are typically superficial, and I've seen plenty of comments pointing out errors in the text which have gone uncorrected. He has good Google ranking, so typically features in my searches, but I skip results from that site ...

I like that term, going to adopt that. I've never liked our "Data Tsar" term ...