Validation Toolkit

Hi everyone

I'd like to introduce you my open source program that I built for the data validation. It allows you to define validation rules and run them against a SQL server database. Feel free to try it out under SqlServerValidationToolkit on GitHub.

Kind regards

Simon

Question : in what circumstances do you envisage someone like me would use your package rather than just putting a CHECK CONSTRAINT on the underlying table? A Check Constraint prevents goofy values getting into the database from any source - rather than reporting on them, after the fact, which IME means that users rarely get around to fixing them in a timely manner, meantime crucial business decisions may have been taken based on the goofy data :frowning:

Hi Kristen

I agree with you that in the normal case the data should be protected either by Check Constraints or by the validation-logic of the application. However I see three situations where the tool can be used:

One is where the application is really not validating the data. This was the initial reason I wrote the tool. It was for a newborn clinic in a hospital where the doctors entered the length of the newborn babies. Because the application used legacy technologies the length wasn't validated by the application and no one seemed to be around to fix it. They tried to correct the invalid lengths but some invalid values got lost, so they had to be filtered for every statistic.

The second is when the validation-logic changes or you are merging databases with different validation-constraints, for example to include them into a data warehouse. In this case it may be helpful to assemble the "invalid" entries at the source.

The third is for the situation where you are getting a database but you don't know how the data that is inside looks like. In that case you can quickly define validation rules that filter uncommon values. Then you can investigate the entries with those uncommon values.

Kind regards

Simon

Scenario one is easily (and better, IMHO) accomplished with check constraints as Kristen suggests.

Scenario two is probably better done with SSIS using lookup transformations and conditional splits to weed out invalid data. Also don't forget Data Quality Services and the related SSIS transforms.

Scenario three is very easily done with Data Profiler. Great tool to spot outliers and anomalies, among many other things.

Check constraints are good if you know exactly that the value is wrong. But sometimes the value is not "wrong", just ominous. For those cases, a softer approach is better in my opinion.

You can achieve that soft approach with SSIS or with my toolkit. With SSIS you can filter the invalid data but you have to create a conditional split and handle the dataflow for invalid data for each table. I also started like this but I wasn't happy that the whole validation logic was distributed in the SSIS package and I thought it would be simpler to define a set of rules and exclude the execution logic from the package. Then I wouldn't have to change it each time my validation logic changes.

In the newest release the toolkit can be better integrated into an SSIS-package. The validation can be started automatically with a command line parameter and the data can be stored in a SQL-server database. After the validation finished, the table "Validation_WrongValue" contains all Id's of the invalid entries. You can then filter out the invalid entries in your ETL-process.