In-place update to cluster index

darlove · February 11, 2016, 12:17pm

Hi. Can anyone please answer this short question? I have a clustered index made of three columns (int, int, datetime). Now, I do an update to the datetime column but the new value is the same as the old one. Will this be an in-place update without any movement or will it be the delete-insert kind of update? Thanks a million.

gbritton · February 11, 2016, 1:49pm

in-place, but don't do it.

write your updates like:

update foo
set foo.col = newval
where foo.col <> newval

darlove · February 11, 2016, 3:01pm

Hi hbritton.

I have some bad news for you The update will not be in-place as the answer to this demonstrates.

By the way, why do you say not to do it?

Best,

Darek

Kristen · February 11, 2016, 3:21pm

Generally I would regard it as "not good" if the Clustered Index is also the Primary Key. We aim never to modify Primary Key columns, and if the PKey is part of a Foreign Key then modifying it at all is a Big Job ...

If the Clustered Index is NOT the PKey, and the key columns have been chosen because they are the most commonly accessed, then I see not reason not to update the Date column if it needs to be changed. We always have a WHERE clause on our UPDATEs so that we never update a row where the data has not actually changed (as in GBritton's example)

If the INT, INT leading columns in the Clustered Index are themselves unique, or unique-for-some-rows, then perhaps? those changes will be in-place.

darlove · February 11, 2016, 3:27pm

Hi Kirsten,

I know the principles of good cluster index design but in this case I don't have a choice because the table is partitioned on the date key and even though the integer columns are, indeed, unique, the date must be included in the index. The date can be modified but it's rare and hence the design is not as bad as one might think.

Question: Why wouldn't you update a column with the same piece of data? Where does it hurt?

To answer your last question---the change will not be in-place. Follow the link in my answer to gbritton's answer and you'll see a material example of this.

Thanks.

Best,

Darek

Kristen · February 11, 2016, 3:34pm

What's the point? SQL will make the update anyway, but change nothing. Wasted CPU, I/O, Log space, backup file size, row-blocking, etc. Plus any trigger will fire.

I see lots of

UPDATE MyTable
SET MyColumn = @NewValue
WHERE OtherColumn = @CriteriaValue

and it always suprises me that the DEV didn't think, or was never taught, to add:

AND MyColumn <> @NewValue OR (MyColumn IS NULL AND @NewValue IS NULL)

Of course if MyColumn will never be the same as @NewValue then there is no need, as in:

UPDATE MyTable
SET MyColumn = @NewValue
WHERE MyColumn = @OldValue

assuming that the user is not allowed to choose @NewValue the same as @OldValue when making the update!!

The reason why I am surprised is that the reason I'm looking at the code is usually that it is performing badly - BLOAT from "Update ThisColumn to NoChangedValue" is huge IME and its a future hidden performance-killer.

Kristen · February 11, 2016, 3:34pm

Personally I don't care (but maybe you have a reason to?). If I need to make the change then ... I need to make the change

darlove · February 11, 2016, 3:39pm

Well,

I don't necessarily agree with this. If you do a very focused update in an OLTP system, it might be true what you say. However, if you do an update to millions of rows (in a data warehouse, for instance), the overhead of checking if a value is different can be much higher than doing the actual update, especially if the update affects many columns, not just one. And by many I mean something that ranges from a couple of tens to hundreds...

Best,

Darek

Kristen · February 11, 2016, 4:01pm

I don't do much data warehouse type stuff (we do freshen up data from, say, OLTP to an Enquiry database, and we do pull data from another application's database into our own local-copym but that's about it). Apart from populating a database how common it is to do a many-row, many-column, update that does not have a WHERE clause of some significance? Of course there is no sense adding code to avoid an update if every row will change so we only need to consider circumstances where UPDATEs are being run which do NOT change any columns ona significant number of the rows being processed.

For me, in OLTP world, I can't think of scenarios where the additional tests (there will already be some criteria) in the WHERE clause would cause the update to be slower than doing the update, and I definitely don't want the side-effect of triggers firing (we have Audit tables on pretty much all the data tables, and triggers doing additional "belt and braces" data checks which take time). There are probably also some Foreign Key and other Constraint checking costs.

darlove · February 11, 2016, 4:04pm

As I said, everything depends on how many rows you have to update and how many columns there are that you have to change. And while you are right in your OLTP wold, you might not be so right in the OLAP one (which is where I work right now).

Let's agree to agree on this one

Best,

Darek

Kristen · February 11, 2016, 4:13pm

OK.

As I don't work in OLAP world an example of a "a many-row, many-column, update (that perhaps does not have a WHERE clause of some significance)" would help me get a better understanding of that scenario, Hadn't occurred to me in my original reply to your "Why wouldn't you update a column with the same piece of data? Where does it hurt?" as it didn't occur to me that your OLAP world might be different to my OLTP one!

I would guess that you probably don't have triggers, and are using SIMPLE recovery model, which would get rid of my two greatest issues. You might also not have as many Constraints (I always wonder how much Umph they steal, every time an insert/update occurs, in policing the system and whether I might enforce some integrity "elsewhere" in the system and dispense with the database also acting as policeman. Presumably in OLAP you have much greater opportunity to take that decision than in OLTP ... if we got Goofy Data in the system the effort to sort it out could be horrendous ...

darlove · February 11, 2016, 4:27pm

You're very much right, Kristen.

I don't have triggers, the model is SIMPLE and I can lock the table completely when I do the update. In a way this is necessary because you don't want lock escalation. Nothing should really interfere with the update.

And the update is also done in batches of 10,000 rows in a WHILE loop which is sooooooo much better than doing one update with 1 million rows in one go... One undeniable benefit of this is that if anything wrong happens, there is not much to roll back. It takes seconds to interrupt the process. With just a cluster and one unique NC index in place, I can currently update 10,000 rows with 163 columns within less than 1 second (and I don't use anything that would say: If the value here is the same, don't do it---I think it's almost impossible with this amount of columns and data).

All in all, OLTP and OLAP are very much different and different approaches are needed, indeed, to make things work fast and seemlessly.

Best,

Darek

ScottPletcher · February 11, 2016, 5:27pm

Why? The best clustered index is the most critical factor for best performance, and often enough that will be the PK.

gbritton · February 11, 2016, 6:50pm

I think the answers proves my point. If the update changes nothing, nothing will move. If the update changes a part of the key that changes the ordering of that entry, then of course it will move. How could it not? It's a B-tree after all

Kristen · February 11, 2016, 7:05pm

Misunderstanding perhaps? I was referring to updating columns that are keys in the Clustered Index

I would not, normally, update any column in the PKey.

Ergo if the Clustered Index is the same as the PKey I would not normally update the Clustered Index.

If the clustered index is chosen for a different reason (e.g. most suitable columns for the most common range query), and that is not the same as the PKey, then I see no reason not to update it if the need arises and also no reason to try to avoid updating it (understanding that it will cause more physical I/O than, say, updating a column that is associated with a non-clustered index)

Happy to hear of different views though.

darlove · February 12, 2016, 10:20am

gbritton, the question was not about moving things around but doing an in-place update vs. the kind of delete-insert update...

I know that if you change part of a cluster, even if you update a field's value to the same value, the position of the key should not change. That's obvious and stems from very basic theoretical considerations about B-trees.

But the question was different, namely, whether the value would change in-place or an insert-update would be needed. And it so happens that even if the value IS THE SAME, an insert-update would happen. The only one time when an in-place change is performed is when you update a text column from a value like 'table' to a value like 'cable'---a change that leaves the text being the same length.

Best,

Darek

gbritton · February 12, 2016, 2:18pm

But you were asking about a datetime[quote="darlove, post:1, topic:5149"]
Now, I do an update to the datetime column but the new value is the same as the old one
[/quote]column, not a text column. So there should be an in-place change.

In any case, best practice is to check that the new data differs from the old before updating it. Kristen showed just how important that can be.

darlove · February 15, 2016, 3:52pm

Hi gbritton.

If you update many rows at the same time (say 100) and the number of rows goes into hundreds of thousands (if not millions), the check if any of the columns differs from what you're going to insert would slow the update down tremendously. It would have a very adverse effect on the update. Therefore, I don't think it's a good idea to have checks like this.

By the way.. You didn't understand my previous case. An in-place update happens IF AND ONLY IF you change text in the way outlined in my post. Any other data item will cause the delete-insert update.

Best,
Darek

gbritton · February 16, 2016, 1:47am

why? SQL has to at least read those pages. The small CPU it takes to check for changes to avoid rewriting them is well worth it. CPU cycles are much cheaper than I/O.

Measure it if you're not sure.

Anyway, you were talking about changing dates, not text (unless I misunderstood your fist post).

darlove · February 16, 2016, 7:12pm

OK. There's no point arguing about text or dates. Obviously, when I have a big table, I'm going to have all kinds of data types in there unless I had a very, very special table. But I don't (thank God).

I'll try to do what you say: I'll measure it. That is the only way to know. But bear in mind that a check is not something simple like column1 <> column2 because you have to account for NULLs as well. If I have to impose 50 checks on an insert, then I think a simple insert could outperform... but we'll see.

Cheers!

Best,
Darek