In-place update to cluster index

ScottPletcher · February 16, 2016, 7:39pm

As I understand it, SQL itself does those "did non-clus key data really change?" checks internally and does not write a log record if the data did not change and does not mark the data page as "dirty". SQL itself "understands" how important it is to avoid I/O.

But a change to the clus key does indeed always cause a delete and an insert.

Either way, I personally would never try to code those checks on a WHERE. Too many changes for serious error vs. the potentially very mild gain: 1) the clus key should change only very rarely and 2) if the clus key got UPDATEd but didn't change, it will obviously stay in the same data page anyway, which must already be in the buffer in order to read that row in the first place. That said, I wouldn't object to a check in the code and generating different UPDATE code to be run dynamically to avoid re-setting the clus key to the same value, if necessary, but I wouldn't use it in a WHERE to negate the entire UPDATE.

Additional details on the "non-UPDATE" UPDATE:
http://sqlblog.com/blogs/paul_white/archive/2010/08/11/the_2D00_impact_2D00_of_2D00_update_2D00_statements_2D00_that_2D00_don_2D00_t_2D00_change_2D00_data.aspx

gbritton · February 16, 2016, 8:07pm

such as?

FWIW I do this with all updates. I make my updates specifically target the the rows I want to change and nothing else. far too easy to write:

update foo
set bar = null

and blow away an entire column. So little effort to add

where col_meets_my_conditions

ScottPletcher · February 16, 2016, 8:21pm

Typo, should have said:
Too many changes for serious error.

Of course you have normal WHERE conditions. What you don't try to do is verify yourself that a column's value actually changed before you allow the UPDATE. That would be just a nightmare to code correctly, and all for no real gain. I just can't imagine why anyone would want to even attempt that.

gbritton · February 16, 2016, 8:27pm

I've always found that pretty simple to code correctly the first time, every time (copy/paste from the SET clause). As for gain? Avoiding the possibility that SQL might mark the page dirty when it's not.

ScottPletcher · February 16, 2016, 8:56pm

Again, SQL does not mark the page as dirty for non-clus-key column non-changes.

OK, so gen me the WHERE clause real quick that checks col_int, col_datetime and col_varchar to make sure an UPDATE is actually needed for the row.

CREATE TABLE table1 ( id int IDENTITY(1, 1) NOT NULL, col_int int NULL, col_datetime datetime NULL, col_varchar varchar(40) NOT NULL )

gbritton · February 16, 2016, 9:02pm

where new.col_int <> old.col_int or new.col_int is null and old.col_int is not null or new.col_int is not nulll and old.col_int is null

... etc.

really simple copy paste (especially using Alt-mouse select in SSMS)

Anyway, I'd like to believe you about not marking pages dirty. Do you have a good ref or is it something you've obvserved in the wild?

Kristen · February 16, 2016, 9:07pm

If we have a monster WHERE clause to check lots of columns (e.g. where we receive a fresh data feed every day, but it always includes "all rows" because the sender is useless and has no ChangedOn date/time column ) then we mechanically generate the WHERE clause to compare all relevant columns (including for NULLs on one-side or other-side, and for Case Sensitive change (which we will physically make) on a column that is collated case-insensitive (because even if for user search case-sensitive is not important, if we print the guy's name out we certainly would like to be spelling it how he just told our call centre over the phone that he liked it to be!!)

I don't know if the CPU for the comparison outweighs the I/O, but we then set a ChangedOn date/time of our own, and we definitely only want to do that for rows that have actually changed

We do find, often enough that it has some value, that a client has a question about "when" some data changed, and our ChangedOn date provides them with an answer that they cannot get from the source owner of the original data ...

When we move data from Server-A to Server-B (e.g. overnight freshen-up) we keep an "as-was" copy of the database on the Source server (i.e. as "local" to the original data as possible). We freshen up that copy with a mega-WHERE clause so only rows that have changed materially [and only in columns that WE are interested in] gets a fresh ChangedOn date/time, and then from the Remote server we "pull" only rows newer than MAX(ChangedOn) in the remote's table. The I/O across the WAN is of course a very different sized pipe than the I/O to the local disks so we think, in this instance at least, that the hefty investment in Mega Where clauses is worth it.

But having written the code to mechanically generate a Mega Where Clause we then tend to use that wherever we think it could benefit - i.e. even if freshening up rows from one table to another, within the same server + database.

Sorry, I realise I'm widening the discussion, if folk mind I'm happy to move to a new thread.

ScottPletcher · February 16, 2016, 9:33pm

Love that "etc."! You have to AND this to all other columns checks and to the primary WHERE conditions. I suspect you'll get parentheses overload if nothing else. Again, that's a sheer nightmare to maintain as the code changes ... drop a column from the SET, you have to modify the WHERE clause, ugh!

gbritton · February 16, 2016, 9:36pm

No nightmare at all. Easy-peasy. and you only wrap the entire (where something changed) in one set of parens. Everything is or'ed inside.

But I still want to know of a good reference showing that SQL won't mark the page dirty. Even though it's easy, I'd love to stop it!

ScottPletcher · February 16, 2016, 9:46pm

Yeah, it's so "easy peasy" you can't bother to do it for just 3 columns. We have tables with dozens of columns. Again, it's just bizarre to me that anyone would code that, even if SQL marked the page as dirty, which it does not.

Kristen · February 17, 2016, 8:34am

I only care if there is a) a performance benefit and/or b) a LOG file / etc. reduction, but if there is the DEV effort is worth it for me.

As I said before, we mechanically generate the WHERE clause so its no bother. If we change the schema we would re-visit the code anyway, in case it is effected in some other way, so regenerating the WHERE clause is both not significant to me time-wise and not an additional task, given that we will be reviewing the code anyway. SCHEMA changes are not an every day occurrence for us

I don't think this is "safe". It won't find an upper/lower (or accent) change on columns that have case/accent insensitive collation, nor a trailing space on one but not the other. It is, however, what we use for all non-string columns.

I would welcome feedback on whether this code that we use is "safe":

FROM MySourceTable AS S
	JOIN MyDestinationTable AS D
		ON D.SomeID = S.SomeID
WHERE
(
	   (DATALENGTH(D.[StringCol]) <> DATALENGTH(S.[StringCol]) 
		OR D.[StringCol] COLLATE Latin1_General_BIN2 <> S.[StringCol] 
		OR (D.[StringCol] IS NULL AND S.[StringCol] IS NOT NULL) 
		OR (D.[StringCol] IS NOT NULL AND S.[StringCol] IS NULL))
	OR (D.[NonStringCol] <> S.[NonStringCol] 
		OR (D.[NonStringCol] IS NULL AND S.[NonStringCol] IS NOT NULL) 
		OR (D.[NonStringCol] IS NOT NULL AND S.[NonStringCol] IS NULL))
	OR ...
)

(Legacy TEXT columns are a problem too as they also need a CAST to varchar(MAX))

How do I performance test this?

Let's say I copy a large table to a new table - i.e. both are identical at that point. Then I UPDATE all rows a) with no WHERE clause and b) with a "mega WHERE clause"

I can check any impact on the LOG file
I can Time the duration
How would I check if there is any change to Pages? Indexes?
Can I use a CHECKSUM? (on the grounds that some unnecessary updates would be OK, but presumably CHECKSUM cannot guarantee to find 100% of ALL differences?)
Is there anything else that it would be interesting to check?

Kristen · February 17, 2016, 8:53am

Perhaps I should be using INTERSECT ?

gbritton · February 17, 2016, 2:00pm

OK,,, just for you (took less than a minute for the where clause and 100 columns would be no different. I'm not a fast typer (quite the opposite, in fact!) but SSMS gives great tools (Alt-select) to make this really easy and fast):

UPDATE T1
set t1.col_int      = new.col_int,
    t1.col_datetime = new.col_datetime,
    t1.col_varchar  = new.col_varchar

from table1 t1
cross join
(select 1, cast('20160217' as datetime), 'one, today') new(col_int, col_datetime, col_varchar)
where 
(
   t1.col_int      <> new.col_int       or t1.col_int      is not null and  new.col_int      is null  or t1.col_int      is null and  new.col_int      is not null
or t1.col_datetime <> new.col_datetime  or t1.col_datetime is not null and  new.col_datetime is null  or t1.col_datetime is null and  new.col_datetime is not null
or t1.col_varchar  <> new.col_varchar   or t1.col_varchar  is not null and  new.col_varchar  is null  or t1.col_varchar  is null and  new.col_varchar  is not null

)

Seriously though, I'd love to see a reference that states the page is not marked dirty if I replace some column with exactly the same data.

gbritton · February 17, 2016, 2:29pm

FWIW I just did a little test. I created a test db, created Scott P's test table, then ran my update statement several times, with and without the where clause. What I found consistently is this:

After the first couple of updates:

Without the where clause, the number of log rows increases by two each time I run the update.
With the where clause (even if painful to construct!) the number of log rows does not increase between updates.
This makes me suspect the claim about dirty pages.

I'd love someone to confirm my findings. Here's what I did:

CREATE DATABASE DirtyPagesDB
GO
USE DirtyPagesDB
GO

/* --

ALTER DATABASE DirtyPagesDB
SET SINGLE_USER
WITH ROLLBACK IMMEDIATE;
DROP DATABASE DirtyPagesDB

-- */
-- Disable automatic checkpoint so that data isn't flushed
--
--DBCC TRACEOff(3505, -1);
--DBCC TRACESTATUS();
GO
CREATE TABLE table1 ( id int IDENTITY(1, 1) NOT NULL, col_int int NULL, col_datetime datetime NULL, col_varchar varchar(40) NOT NULL )

insert into table1(col_int, col_datetime, col_varchar) 
select 1, cast('20160217' as datetime), 'one, today'

CHECKPOINT

--number of log rows so far:
select COUNT(*) from fn_dblog(null,null)

UPDATE T1
set t1.col_int      = new.col_int,
    t1.col_datetime = new.col_datetime,
    t1.col_varchar  = new.col_varchar

from table1 t1
cross join
(select 1, cast('20160217' as datetime), 'one, today') new(col_int, col_datetime, col_varchar)

-- Add the change-checking where clause
where 
(
   t1.col_int      <> new.col_int       or t1.col_int      is not null and  new.col_int      is null  or t1.col_int      is null and  new.col_int      is not null
or t1.col_datetime <> new.col_datetime  or t1.col_datetime is not null and  new.col_datetime is null  or t1.col_datetime is null and  new.col_datetime is not null
or t1.col_varchar  <> new.col_varchar   or t1.col_varchar  is not null and  new.col_varchar  is null  or t1.col_varchar  is null and  new.col_varchar  is not null
)

-- number of log rows now:
select COUNT(*) from fn_dblog(null,null)

darlove · February 17, 2016, 2:45pm

Hi gbritton.

I'd like to stress that what I'm doing is an update to a table with 40+ million rows, 180+ columns and I don't do it in one UPDATE. I update ~10,000 rows at a time (in a WHILE loop) and the database recovery mode is set to SIMPLE.

Checking if out of among 180 columns none has been changed would be a REAL NIGHTMARE. I still think (for the time being at least) that doing a straight update will be faster than an update with all the checks in place.

I'll try to check this assumption but this is a rather big task. I'll do it when I have a bit of time...

Best,
Darek

gbritton · February 17, 2016, 2:47pm

If you post me the column names (original and changed) I can build you the where clause in a minute or two.

ScottPletcher · February 17, 2016, 3:07pm

Yes, the two log rows added are for start of transaction and end of transaction. Try SELECT * instead of SELECT COUNT(*) from the log.

If you don't trust the link I poster earlier from Paul White -- who's well respected in SQL Server -- keep doing it your way. I hope you don't falsely prevent any UPDATEs from occurring to your tables, but I'm fairly certain you will. Given that volume of extra code, some type of bug(s) are bound to work their way in.

Kristen · February 17, 2016, 4:09pm

Very interesting, thanks (and not what I would have expected).

Some caveats though (just paraphrasing here for the sake of discussion in this thread):

Physical updates will still occur if:

Clustered Index columns are included in the SET statement (even if the data they contain does not change). Easy enough to exclude them from the SET, and conditions where the data in CI columns change is rare, so unlikely that they will be in the SET already, so that one is probable "just fine".

Physical update will occur for a VARCHAR / Nvarchar column having more than 8,000 / 4,000 characters. I have plenty of columns defined as VARCHAR(MAX) etc. but very few ROWS where the data exceeds 8000 chars, so that would be fine for me too (99% of the time).

If ALLOW_SNAPSHOT_ISOLATION or READ_COMMITTED_SNAPSHOT is ON then UPDATE will always make a physical update. That would be a deal-breaker for me, but I wonder if I could turn READ_COMMITTED_SNAPSHOT off for the duration of the update; I don't suppose I need it to be turn on in the middle of the night on a database only used 9am-5pm

I'll make sure that my tests are't made on a database with READ_COMMITTED_SNAPSHOT turned on.

EDIT: Ant trigger will still fire, so that would be a situation where there could be unwanted "side effects"

Kristen · February 17, 2016, 4:11pm

One thing I'd be interested to look at. Presumably an UPDATE on a column with case-insensitive collation would physically happen if the only difference was the Case of a character. However, in a (simple) WHERE clause of Source.MyColumn <> Destination.MyColumn that would not see the Character Case as being different so NO update would happen. (Personally I think that would be a bug in the APP code, but it might be seen as a desirable side-effect in some circumstances).

ScottPletcher · February 17, 2016, 4:16pm

Hmm, I wouldn't recommend turning snapshot off and on if it's expected to be on.

But yet another reason snapshot isolation should be used only if really needed. I've heard people recommend it should be turned on as a default (hey, if I wanted to run like Oracle, I'd have bought Oracle! ). It's a wonderful option to have available, but be aware of the large overhead involved.