Where clause of an update statement

Greetings!

I have a large table (45M rows) that I need to combine a date field and a time field into a DATETIME field. The SQL statement is sound in that it works on smaller sets, but larger sets seem to be giving me issues.

When I use larger sets to update (60+ days) I get the message that a string conversion to date time has failed. If you look at the query, I have removed any chance of invalid dates or times from being included in the update statement. Furthermore, I can lower the record count (shorten the number of days in the where clause) and it runs without issue. I can repeat that for the exact same date ranges of the original query and it works fine. (albeit in multiple queries to accomplish this)

I have seen similar behavior in the past on other update statements and never really looked too deeply into it. It appears that when a large number of rows are targeted the database scans outside of the where clause before it executes. The query appears to be evaluating records that are not within the range of the where clause and causing a failure.

Is this possible? How do I limit the update statement and make it stay within the bounds of the where clause?

Any help in my understanding of the issue will be greatly appreciated!

Sample below (NOTE: I cut it short on the in clause, no need to clog up space here just to list off all the possible valid times. The statement ends with ...,'2359')

update tracing
set locdatetime = convert(varchar, locationdate, 101) + ' ' + substring(locationtime, 1, 2) + ':' + substring(locationtime, 3,2)
where locationdate >= '10/1/2013'
and locationdate < '11/15/2013'
and locdatetime is null
and locationtime in('0000',
'0001',
'0002',

I suspect you have "dirty"(e.g. not comply odd ones) data in your locationdate or locationtime field. That caused the conversion failure. Try to find them in your where clause, then either fix them or filter them out for the Update.

That is why I limited it to specific dates and to only the times listed which are the valid times. There should be no possible way to have invalid dates or times according to the filter (where clause). If the where clause is applied during the update, how could invalid data be included for conversion?

'10/2/2013Bob'sYourUncle' passes your filter but is an invalid date, I think.

There are no invalid dates, the date field is a DATETIME so it would/could not store anything beyond the range of valid SQL dates. The time field is a varchar(4), so I limited it to be just the legitimate possible text values.

Also, if I try to query for DATE >= '1/1/2014' and DATE < '1/1/2015' it fails, if I go through one month intervals DATE >= '1/1/2014' and DAE < '2/1/2015', then do the next month and so on through 1/1/2015, it works fine. Why would reducing the range from 12 months to one month make a difference?

The other issue is that I have run the queries (hundreds of them) in one month intervals and completed the update for all 45M rows without modifying any other data. The remaining rows are 28k which do not have DATES, they are null.

What is odd is that I can limit the set down to a smaller subset and achieve the desired results (by running hundreds of queries) instead of running a query or a few queries to accomplish the task.

In the end, ALL records have been updated. Why is a reduced set required to get SQL to only apply the update to the where clause filter and not evaluate records outside of this restriction?

It is almost as thought SQL is looking at NULL + Substring(TIME, 1,2) + ':' ... and failing when clearly the NULL dates are excluded by the where clause. (as are any invalid times)

Pity you didn't say the column has a datetime type in the first place. Would've saved some useless replies. Also, when using literals for datetime comparisons, use the 'yyyymmdd' format, since it is locale-independent. Thus

DATE >= '20140101' and DATE < '20150101'

is better practice

Please confirm that on your machine this fails:

declare @ table ([date] datetime)
insert into @ values('20140101')

select 'I passed!'  as 'The Test'
from @
where [date] >= '1/1/2014' and date < '1/1/2015'

And post the error message(s).

Why is a reduced set required to get SQL to only apply the update to the where clause filter and not evaluate records outside of this restriction?

Because there is either something wrong with your data, your query or both.

Since locationtime is varchar(4), there could be rows where the the data is not a valid time. SHould be easy to find.

For that matter, why varchar? It costs two bytes for the length of the data, but the data is always hh:mm correct?

NOte that this works fine (returns null):

select null + substring ('hh:mm', 1,2)

Other problem: don't use DATE and TIME as column names, since those are also SQL keywords.

declare @ table ([date] datetime)
insert into @ values('20140101')

select 'I passed!' as 'The Test'
from @
where [date] >= '1/1/2014' and date < '1/1/2015'

Works just fine, output = I passed!

Field names are not DATE and TIME, they are locationdate and locationtime, I redacted them in further comments to save time.

Yes, I should have mentioned in the first place it was a date field, that would have helped.

I don't see how I could have an invalid time in the mix as the time is presented to us in a file as varchar(4) representing HHMM. The IN() statement of the where clause contains the 1,440 valid HHMM combinations represented on a 24 hour clock. Therefore, there cannot be invalid times being presented.

I expected there to be 28k rows that ended with a NULL value in the LOCATIONDATETIME field and that is confirmed to be true after the fact. The question still remains, why did I have to break it into multiple smaller queries instead of a single query? It is as though SQL is not applying the where clause and including some or all of the LOCATIONDATE = NULL records into the evaluation.

The error message I received was "Conversion failed when converting date and/or time from character string."

BTW, Thanks for the quick replies. I really need to understand why this is behaving this way. Not for a production fix, just for my edification. It makes no sense to me why I had to do this in this manner. This is not my first rodeo here with SQL, been at this for 20 years. This one has me stumped though.

That's interesting. I suppose I could think of a more verbose way to code that filter, but it would take me some time! Why not simply:

WHERE len(locationtime) = 4 and locationtime not like '%[^0-9]%' and left(locationtime,2) between '0' and '2' and substring(locationtime,2,1) between '0' and '3'  right(locationtime,2) between '00' and '59'

Tell me, does SQL throw the error right away, or after processing for some time? If the former, it is a quirk, I think. SQL sees a potential conversion problem and refuses to execute the query. One way around it is to use a CASE statement in the SELECT clause so SQL can see that you are handling any potential exceptions.

I tried using the range method with the same results. I went total kindergartener on this and actually listed the entire 1,440 combinations to make sure nothing slipped in.

It processes for about 5 minutes in SSMS then tosses that error. If I just go reduce the date range down to a smaller set and hit execute it works just fine. I don't change anything else on the query but the ending date.

This one is odd, I can't figure out what it is doing. It almost seems like it is doing a pass of the table first to do the concatenation without regard to the where clause when the set is above a certain threshold.

Is parallelism indicated in the plan?

FWIW I'd still put a CASE statement in the SELECT to see if your results change.

Another way:

set locdatetime = dateadd(hour, dateadd(minute, locationdate, cast(right(locationtime,2) as int), cast(left(locationtime,2) as int)))

I wish I could upload a picture of the plan, alas, I am but a rookie here.

There is parallelism in the plan, I'll do my best to represent it via ASCII lol...

Update (Cost 0%) <- Clustered Index Update (6%) <- Top (0%) <- Parallelism (Gather Streams 0%) <- Hash Match(Right semi join 0%) [Splits to two lines, top line first] <- Paralellism (Distribute streams 0%) <- Constant Scan (0%) [Lower line] <- Table Spool (Eager Spool 0%) <- Parallelism (Repartition Streams 1%) <- Compute scalar (0%) <- Clustered Index Scan (RDB%PRIMARY2 93%)

Not sure how readable that "report" will be. The short answer is yes, there is parallelism being applied here.

I will give a case statement a shot and see if it effects the outcome. I should know here in a few minutes. Curiosity is getting the better of me now.

Let's not take any changes at all:

update t
 set locdatetime = convert(varchar(10), t.locationdate, 101) + ' ' + 
     substring(ca1.locationtime_force4, 1, 2) + ':' + substring(ca1.locationtime_force4,3,2)
from tracing t
cross apply (
    select right('0000' + CAST(t.locationtime AS varchar(4)), 4) as locationtime_force4
) as ca1
where t.locationdate >= '20131001'
 and t.locationdate < '20131115'
 and t.locdatetime is null
 and 1 = case when t.locationtime like '%[^0-9]%' then 0 when t.locationtime between 0000 and 2359 then 1 else 0 end

Too late!

set locdatetime = CASE LOCATIONDATE
WHEN NULL THEN NULL
ELSE convert(varchar, locationdate, 101) + ' ' + substring(locationtime, 1, 2) + ':' + substring(locationtime, 3,2)
END

Worked like a champ. That means that SQL is looking outside of the WHERE DATE filter and reviewing records for some reason. How would that be possible? Why would the case statement trump the where clause?

I'll try your model too and see what happens. This is getting awfully strange.

Unfortunately 1299 is between 0000 and 2359

The case statement won't be executed until projection.

Why would SQL not filter down the recordset first then apply the concatenation? Seems odd that when it hits some threshold it would go into a table scan without regard to the evaluation of the where clause.

If I were going to do a table scan, I would first limit the records to be scanned by applying the filter first. This would have to be somewhat faster if not just more organized. The other question I would have is that if this is true, what is the threshold for making it a scan first and can I prevent this with a switch or flag rather than complicating an update statement with a CASE statement?

In reality, the case statement is reiterating what the where clause is doing. In certain instances this type of case logic may not be so straight forward.

Would love to have a handle on this as this seems to be a potential issue for me in the future.

BTW - Your modified query above resulted in:
Msg 242, Level 16, State 3, Line 1
The conversion of a char data type to a datetime data type resulted in an out-of-range datetime value.

That's significant. There is probably a bad locationtime in there, maybe something like hhmm, where mm > 49

and 1 = case when t.locationtime like '%[^0-9]%' then 0 when t.locationtime / 100 between 00 and 23 and t.locationtime % 100 between 00 and 59 then 1 else 0 end