MIN aggregate function to a BIT field

JeffModen · October 27, 2015, 12:37am

That would raise the same kind of non-SARGable hell as having NULL EndDates in a table, IMHO. I suppose there could be a reason to have "unknown" conditions for a bit but I can't think of one requiring such tri-state logic in a properly designed database.

Shifting gears to the public to ask the general question... If anyone has a good idea of when tri-state bits (nullable bits) would be appropriate (as in good design). I'd sure like to hear what it is.

gbritton · October 27, 2015, 12:57pm

Simple example:

You have a database for an obstetrician. You have a patient table has a bit for sex of the baby: 0 = Male, 1 = Female, null = unknown (too early in the pregnancy to tell)

gbritton · October 27, 2015, 2:39pm

Note that this is extremely common in type 2 SCDs

ScottPletcher · October 27, 2015, 5:13pm

A NULL EndDate wouldn't necessarily make the WHERE condition nonsargable, unless one made the egregious error of using "ISNULL()" in the WHERE, which should never be done anyway.

I can't post an in-line tally table, so I had to use a physical one, but it still demonstrates the "seekability" of "OR IS NULL":

INSERT INTO #t1 ( start_date, end_date, description )
SELECT DATEADD(SECOND, -tally * 5, GETDATE()) AS start_date,
    CASE WHEN tally % 10 = 0 THEN NULL ELSE DATEADD(SECOND, -tally * 5 - 5, GETDATE()) END AS end_date,
    'Tally value was ' + CAST(tally AS varchar(10)) AS description
FROM dbo.tally
WHERE tally BETWEEN 1 AND 1000000

SELECT COUNT(*) FROM #t1

SET STATISTICS IO ON

SELECT COUNT(*)
FROM #t1

--as expected, this is a nice index seek
SELECT COUNT(*)
FROM #t1
WHERE start_date >= DATEADD(MINUTE, -10, GETDATE()) AND
    (end_date <= DATEADD(MINUTE, -9, GETDATE()))

--this is also an index seek
SELECT COUNT(*)
FROM #t1
WHERE start_date >= DATEADD(MINUTE, -10, GETDATE()) AND
    (end_date <= DATEADD(MINUTE, -9, GETDATE()) OR end_date IS NULL)

SET STATISTICS IO OFF

DROP TABLE #t1

JeffModen · October 31, 2015, 11:08pm

I say it should still not be nullable and it certainly should not be the bit data type. There is such a thing as a hermaphrodite and the default should be something other than a NULL for "Not yet tested".

JeffModen · October 31, 2015, 11:10pm

You're not wrong there. That's precisely one of the uses of NULLs that I try to avoid because of the NULL logic required when you want to know if something will either expire in the future or has no expiration date. Yes, I do agree that the required NULL logic isn't difficult but it's still extra logic.

JeffModen · October 31, 2015, 11:34pm

No problem. What are you using for an index on #T1?

gbritton · November 2, 2015, 1:57pm

Granted, however it is pretty easy to cook up other simple examples. Schrodinger's cat. e.g. it's condition is unknown (null) until we open the box. After that it's definitely alive or dead. (not aware of other possibilities!)

ScottPletcher · November 2, 2015, 3:19pm

Sorry for not showing that. It's ( start_date, end_date )

stephen_hendricks · November 2, 2015, 11:30pm

I was going to open up the whole NULL column conundrum but decided that it was a bit (no pun) off topic from the OP. BUT... since the topic has been broached...and without being fanatically religious about it....

No column should be nullable! At the very least, you should have a VERY good reason for allowing NULLs. In almost every situation, you can find an acceptable default value to use in lieu of null (EndDate = Positive infinity, for example). It eliminates those nasty "OR MyColumn is Null" coding constructs that impede performance.

Can you find an esoteric condition where nullability is the best approach? Sure, granted. But most of the time an attribute is nullable because no one took the time to ask if there was a better way to go. (IMHO)

gbritton · November 3, 2015, 2:31pm

One reason: If you have a lot of columns where the contents are optional or unknown, making them nullable also lets you use SPARSE and save some space.

Personally, I don't like a default value for a column where the values are truly unknown and may never be.

JeffModen · November 7, 2015, 2:57pm

Wouldn't it be better to include a properly designed sister table to include the eventuality of non-NULL values? Even if you do have a good reason to store NULL values, if the column is of a fixed length data-type, it's going to occupy the space on the row of that data-type whether you have a NULL or not. Forgetting about backups and the space it may occupy in cache, it also means that the number of rows per page are reduced and that has an effect on the performance of the queries that may brush against the clustered index.

Yes, I realize that it's convenient to not have to do additional joins but that ma justify the use of properly built views to present the data as a whole.

gbritton · November 9, 2015, 1:37pm

even if SPARSE? I thought that was the reason for SPARSE??

JeffModen · November 9, 2015, 2:39pm

Absolutely correct. But, like all such tricks, it comes at a cost. Here's a quote from BOL.

Sparse columns are ordinary columns that have an optimized storage for null values. Sparse columns reduce the space requirements for null values at the cost of more overhead to retrieve nonnull values. Consider using sparse columns when the space saved is at least 20 percent to 40 percent.

Admittedly, I've not done a test on how much more overhead that would entail or whether it would be cheaper than a sister table, etc. I do try to design tables where a column won't suffer from NULLs never mind getting to the point of where I'd have to use something like SPARSE.

I'm also a bit concerned about such code. How long will it actually be available? Heh... don't laugh at that fear. Everyone at my old shop was raving about VARDECIMAL when it first came out. What did that last before they deprecated it?

ScottPletcher · November 10, 2015, 9:21pm

I believe compressed rows also do not store fixed-length columns if they are NULL, if at all possible.

It's easy enough to envision valid uses for NULL values. Tournament pairings/results, for example. For NCAA basketball, the locations and datetimes of the later rounds are known at the start -- and presumably/potentially in a table -- but the teams that will be playing are literally unknown for later rounds until earlier rounds are completed.

JeffModen · November 11, 2015, 4:34am

Wouldn't it be more properly normalized to have a separate table that matches the teams to the rounds instead of having the teams in the same table as the rounds?

gbritton · November 11, 2015, 8:04pm

I think that's what Scott meant. A table with rounds and foreign keys to the teams. However, the pairings are unknown until later.

ScottPletcher · November 11, 2015, 8:10pm

Yes, it would be a separate table, but if you don't pre-populate all entries, then you have to combine results from more than one table when showing both completed and uncompleted rounds. When I've done this for bridge matches, I admit I've just de-normalized it all into one table with the results NULL until known later, when you just UPDATE the results rather than having to INSERT later.

gbritton · November 11, 2015, 8:25pm

I don't mind designing tables where a column exploits the advantages of NULL values, though. VARDECIMAL? Yeah, that was a weird one. At the moment though, the best place to look for deprecated features is Azure SQL database since it shares the same code base as on-prem SQL Server and is quite a bit ahead in the dev cycle. Basically, anything that will be deprecated will likely happen in Azure first. Will SPARSE go away? Who knows? It's certainly not something I'm going to worry about though.

ScottPletcher · November 11, 2015, 8:44pm

Yeah, I can't see not using features just because they might go away. Sure, avoid things MS has already explicitly deprecated if at all possible, but I wouldn't want to forego the best solution at that time just because it could possibly be deprecated.