Alternative to sp_updatestats

Kristen · May 13, 2017, 7:12am

I have a 3rd party script which is using sp_updatestats inside a loop (during a large data import and subsequent, somewhat massive!, updates of the data according to workflow rules etc).

I don't have a problem with that, but a) it's taking some time and b) its outputting huge amounts of

[_WA_Sys_MyColumn_05A3D694], update is not necessary...

wibble, which is polluting the output and making spotting any errors etc. tricky and c) the sp_updatestats runs even if the number of rows changed in that loop iteration is tiny, or zero even

SQL DOCs says "For disk-based tables, sp_updatestats updates only the statistics that require updating based on the modification_counter information in the sys.dm_db_stats_properties catalog view, thus avoiding unnecessary updates of statistics on unchanged rows."

Doe that mean that using sp_updatestats in this way is NOT doing anything that the routine dynamic update of statistics wouldn't also do? or is sp_updatestats using a lower threshold for the modification_counter perhaps?

Could I just rely on the system rebuilding stats dynamic when it feels the need? I'm sure the Vendor put this in there because of a bad experience at some time!

Could try to just use UPDATE STATISTICS on specific table(s), but in this scenario its pretty difficult to know which tables might have had massive numbers of rows inserted (and be future-proof in that regard) so I can see why the 3rd party took this "update everything" approach

One thought I had was to create a job that runs sp_updatestats and just have the script launch that job, so it runs in parallel (maybe there is another way to achieve that?) so that the output is not polluting the primary job.

Kristen · May 15, 2017, 4:51pm

Bump

jeffw8713 · May 15, 2017, 6:21pm

sp_updatestats just checks if the modification counter has a value greater than zero - it does not look at how many rows have been updated. It also uses a sampling rate by default - unless you utilize the 'RESAMPLE' option which will use the previous sampling rate that was used.

Auto update stats also uses a sampling rate - but it only kicks in when a certain threshold has been reached. This sampling rate may or may not be good enough depending on the distribution of data in the column.

My recommendation would be to build your own code - and determine what thresholds are valid and whether or not a sampling rate for that table is valid for your system. That is the only way to be sure that you are updating the stats as needed for your application.

Kristen · May 15, 2017, 6:41pm

That's a good idea - which I should have thought of!!. At the least I could just do what sp_updatestats does EXCEPT outputting a million rows of "Here's the Name of a Statistic for internal use only and not on a real index, which has NOT been updated coz nothing's changed" messages!

sp_helptext 'sp_updatestats'

shows its pretty straightforward, and only 173 lines long

In fact if I was feeling pretty brave I could just put it back in MASTER with the

raiserror(15653, -1, -1, @ind_name_quoted)

line commented out. Running this

raiserror(15653, -1, -1, 'XXX')

outputs

    XXX, update is not necessary...

Always bothers me when I look at System SProcs how "amateurish" the code looks. Things like this look ripe for improvement to me:

	if (len(@options) > 0) select @options = @options + ', NORECOMPUTE'
	else select @options = 'NORECOMPUTE'

Kristen · May 16, 2017, 1:30pm

Wasn't that straightforward. sp_updatestats uses a system function (stats_ver_current) which can only be run from within a system Sproc ... so its a bit more work that just "comment out the useless RAISEERROR message")

jeffw8713 · May 16, 2017, 9:23pm

I was going to mention that...I wrote a custom procedure to do this which can be very complex depending on the rules you want to implement. Here is a query to start with:

 Select ss.[object_id]
      , ss.stats_id
      , object_schema_name(ss.[object_id]) As SchemaName
      , object_name(ss.[object_id]) As ObjectName
      , ss.name As StatsName
      , Case When ss.auto_created = 0 And ss.user_created = 0 Then 'Fullscan'
             When ss.user_created = 1 Then 'Default'
             When ss.auto_created = 1 Then 'Default'
             Else 'Default'
         End As SamplingRate
      , sp.last_updated
   From sys.stats                                                       ss
  Inner Join sys.objects                                                so On so.[object_id] = ss.[object_id]
   Left Join sys.indexes                                                 i On i.[object_id] = ss.[object_id]
                                                                          And i.index_id = ss.stats_id
  Cross Apply sys.dm_db_stats_properties(ss.[object_id], ss.stats_id)   sp
  Where sp.[rows] > 0
    And so.[type] In ('U', 'IT')
    And sp.modification_counter > 0
    And (i.is_hypothetical = 0 Or i.is_hypothetical Is Null)
  Order By
        Case When ss.auto_created = 0 And ss.user_created = 0 And sp.[rows] <> sp.rows_sampled Then 1
             When ss.user_created = 1 And sp.[rows] <= @lowerLimit And sp.[rows] <> sp.rows_sampled Then 2
             When (ss.auto_created = 0 Or ss.user_created = 1) 
              And sp.modification_counter > sqrt(1000 * sp.[rows]) Then 3
             When (ss.auto_created = 0 Or ss.user_created = 1)
              And sp.modification_counter + (sp.modification_counter / (datediff(hour, sp.last_updated, getdate())) * 24) > sqrt(1000 * sp.[rows]) Then 4
             When ss.auto_created = 0 And ss.user_created = 0 Then 5
             When ss.user_created = 1 Then 6
             Else 7
         End
      , cast(sp.last_updated As date)
      , sp.modification_counter desc
      , ss.stats_id;

You can add logic to set norecompute based on your specific requirements - additionally you can specify the sampling rate or any other values based upon your requirements.

The order by statement is where the real work is done...we are attempting to order the results so the statistics that really need to be updated are at the top...this way we can place a limit on how long or what times this runs.

The idea is that any statistics that support an index where the stats were previously updated using a sampling rate are selected first, then we select user created that are less than a limit of rows with a sampling rate, then user created that would be selected based on the modification count, etc...

Of course - that can all be adjusted based on what works for your systems. This is used on my larger systems (3TB and 8TB) - and we never get through the whole list over 8 to 10 hours or processing.

Kristen · May 17, 2017, 5:59am

Awesome, thanks

Kristen · May 20, 2017, 10:02am

I cheated in the short term ... and set up a SQL Agent Job to run sp_UpdateStats on the particular database in question ...

... that died because the script launched the job whilst it was still running from previous Loop (and yet if an actual job is launched, and still running, SQL Agent takes care of that ... I would have thought that sp_Start_Job would have done, or had a @ParameterFlag, to do the same. Ho!Hum!)

So I wrote a wrapper for sp_Start_Job to NOT start the job if it was already running.

As I'm sure you know detecting that a Job is still running is NOT straightforward either ... there is the issue that a job that was running when the server crashed / rebooted / had a power failure does not get marked as Completed, so the "Is Job Running?" query is "complex"

Daft that loads of people are either "falling into that trap" or "writing code to work around it" when MS could just improve the system procedures a small bit and make everyone's life easier.