SQLTeam.com | Weblogs | Forums

Function to compare arbitrary datasets?

I'm not sure if this can (or should) be built in TSQL. I have built a couple of sales reports which we would want to compare in a few ways. As an example sales volume. Dataset is pretty straightforward. There's a shared key - though there may not be a match across both datasets - and some numerical value.
I know how to do this as a one-off for one dataset. Create a new TVF, call the original datasets, make comparisons & output comparisons. Is there a way to do this with arbitrary datasets? IE I won't know how many columns, what they are called etc. But I would like the average of those datasets.

If not, I know I can do it in Python, but was hoping to keep it all in the same place. Thanks!

Example Dataset #1

Product Sales
Item1 $200.00
Item2 $500.00
Item3 $100.00

Example Dataset #2

Product Sales
Item1 $40.00
Item3 $800.00
Item4 $20.00
Item5 $10.00

so for this sample data you provided (static) what would the final result look like that you want to see without using any TVF just purse SQL

For dynamic sets of data with dynamic sets of columns, you could leverage sys.tables and sys.columns to dynamically process stuff, but you would need some guardrails of sorts, configuration table (min/max tables, max/min columns , data types to compare: int, money or ... etc), so that it does not go off rails.

As the most basic example, just the difference between both tables. Dataset2 minus Dataset1

Product Sales
Item1 $(160.00)
Item2 $(500.00)
Item3 $700.00
Item4 $20.00
Item5 $10.00

Though, I'm heavily oversimplifying here. The truth is that these reports are TVFs built on top of other TVFs. sys.tables/columns do keep the datatypes for the tables that the data came from, but by the time the dataset is complete, it is far removed from its source.

Is there any approach that does not require those guardrails? Can you point me in the right direction?

The guard rails are not required. It's just that without them it will be a Fukushima Daiichi nuclear disaster scenario, it could get stuck in this infinite query that could bring down other systems if not isolated.

Sys.tables and sys.columms are dynamic, depends on how you use them.

Also is this bidirectional ? You did minus could there be other types of calculations? Summation, multiplication division etc

Sounds to me you want some sort of number crunching beast. Maybe sql is not the right tool? Some sort of machine learning or analytics tool might be best suited but again very sparse requirements.

Will this run on some schedule, how often. Or is it fired on demand?
How would you know to compare which columns? What if one is boolean and the other is numeric? How do you prevent from doing a calculation that could error out because of incompatible column data types

TBH, knowing that this is possible is already a good reason not to approach this at all.

It is not bidirectional and there are would be other calculations. Though I thought that providing the actual datasets & actual calculations would distract from the point.

Fair enough. I can do this in Python. My goal was to keep all of the logic of a report on the SQL layer but it is not a requirement. Especially if it is both prohibitively complicated & dangerous to the stability of the system.

I imagined supplying a list of column headers for data that should not be compared. Then using Try_cast for everything (minus the shared key), casting to a decimal & doing the calculations. Then excluding all columns whose sum is 0 from the output as that either means that Try_cast failed or that no data was supplied. That way data would either remain on the dataset but not get calculated, or it would get calculated, or it would attemp a calculation, fail & fall off the resulting dataset altogether.

You'd need a stored proc for that, rather than just a function. You can do anything you would ever need to do in a stored proc.

OK now it is becoming clearer.

"I imagined supplying a list of column headers for data that should not be compared. " <= this is a clearer requirement, we need more of that.

So if it not bidirectional, how do you determine what is on the left and right of the comparison?