What housekeeping / maintenance scripts do you run?

Kristen · September 7, 2015, 2:07pm

I'm revamping our housekeeping & maintenance scripts. I would be interested to hear what data you record for information-purposes, or variations on housekeeping that you perform.

We have an "Admin" database which looks after this type of thing. That contains all informational data that we record (e.g. database disk file size over time, and when databases were backed up etc), and it also contains Sprocs for the actual database housekeeping & maintenance.

Our Admin Database contains:

Backups

Table of database names and the "category" of backup that they need. Any newly created database is automatically added to the table with a "Normal" backup schedule.

For us:

High priority = TLog backups every 10 minutes, DIFF at midnight, FULL on Sunday night
Normal priority = same but TLog backup is once per hour

We have an Sproc that makes the database backups, and we EXEC that Sproc whenever we want to take a "manual" backup (e.g. before a risky procedure!). Parameters are: database name (default = current DB), Backup Type (TLog, Diff, Full) and an optional Comment. We don't allow any other means of taking a backup which ensures that the backup files are all in the "normal" folder.

Index Rebuilds and Update Statistics

We have an SProc with a parameter for the name of the Database. Various other parameters available to control the threshold for Fragmentation level etc., but I don't think we ever fiddle with them. This will basically rebuild all indexes that are fragmented in all databases (actually for any index which is not tiny it uses REORGANISE rather than REBUILD). Like the backup we have a High priority and Normal priority and any new database is automatically added at the Normal priority. Normal only runs at weekends, High priority runs every night. We record, historically, how fragmented the indexes were (and whether they were rebuilt, or not)

We don't, currently, have any means of detecting when Statistics are out of date, so when the Index Rebuild SProc runs it Updates Statistics on ALL tables (even if the index was not rebuilt because it was not fragmented). I'd like to improve that and would appreciate any ideas on thresholds for trigger an Update Statistics on a particular table/index.

Information Gathering - collected once Daily

Server Config - gather, and store "historically", various Registry and ServerProperty settings.

File Info - ditto for physical database files. We store the path/filename although, for us, this should never be non-standard, but would be handy for a post mortem if the DB moved somewhere strange! More importantly, for us, is the size of the files and thus the growth-rate-over-time. Always nice to give management early warning of a purchase of more disk space / a new server

Various rows from sys.dm_os_wait_stats
Various rows from sys.dm_db_index_usage_stats

Information Gathering - collected every Ten Minutes

Various rows from sys.dm_os_performance_counters (for some counters a measurement Delta is calculated, and stored, from two readings taken 10 seconds apart)

A Purge routine deletes old/stale historical data (mostly that is anything older than 3 months)

waterduck · September 8, 2015, 2:25am

do you mind to include data retention period, data compression for old data, closed column-store state?

Kristen · September 8, 2015, 6:57am

Thanks WD.

Our backup routine has a Retention Period. We keep full backup for 4 weeks, diffs for a week, and TLogs until the DIFF they refer to is deleted. When the Backup Sproc runs it deletes any file with retention past-due.

We use [SQL's native backup] data compression for all backups (I raised a thread on here asking if there was any reason NOT to compress, and the consensus was that there was no good reason not to )

That's in terms of the actual Backups ...

For Data we have PURGE Sprocs, which are application-specific (and, as such, I see them as being APP related rather than Housekeeping/DBA related, but ... its a grey area!). Those SProcs delete old / stale data. Typically this is data in our Audit tables (we have an audit table on almost every data table, we use triggers to store DELETED - so we do NOT store current record version in audit table, which saves disk space). When we purge audit table we retain one row (for each PKey value) which is older than the threshold [so we can say "Audit contains, say, 3 months data and the next oldest change was NNN months/years before that" which users sometimes find helpful - particularly if it was a long time previously as they then know that the record could not have been changed in the period of interest. But other than that we don't compress old data - I'd be interested to hear what other folk do in that regard??

All our APPs are OLTP so we don't have an column-store indexes (or perhaps we should have? but my understanding is that INSERT/etc. not possible on a table with active column-store index?)

Kristen · September 8, 2015, 10:32am

I am now wondering if I should store most/all columns from sys.indexes in my Fragmentation History table?

Two reasons:

[fill_factor] is pretty important in deciding if a fragmentation percent is "high", or not.
over time I'd like to know if anything has twiddled with any of the index properties. This may be more relevant to 3rd party APPs than my own ...

which then leads to whethre I can be bothered to create a Delta-Changes table so I only store the sys.indexescolumn values when they change, rather than duplicating them in every row in my Fragmentation History table ... and then I have all the added hassle of JOINing my Fragmentation History table to the RIGHT version of the sys.indexes history table <sigh!>

Unless someone has a smart means of doing the "join to row that was in place at that time" ?

Along the same lines I'm thinking I should not store Database Name, Table Name, Index Name in every row, but I ought to have lookup-tables for those. <Sigh2!!>

djj55 · September 8, 2015, 12:37pm

How often do you try to restore a database from backup? This being an important part of any DR plan.

stephen_hendricks · September 10, 2015, 6:22pm

One thing that we have done is to have a "Master" administrative job launcher whose role is to invoke a series of maintenance jobs. Each administrative task is defined within its own SQL Job but has no schedule; the master job is scheduled and invokes each sub-job. This has a few benefits for us:

Resource contention is minimized; One task that locks a table won't cause another task to fail with a deadlock
Resource utilization is optimized since once one task is completed, the next task gets invoked; no guessing how long any task might take and scheduling the others accordingly.
If one task fails, its job can be individually invoked once the issue is resolved. Compared to a SQL Job with multiple steps where if you want to re-run step #2, for instance, the SQL Job will automatically re-run all of the subsequent steps which you may not want/need.

stephen_hendricks · September 10, 2015, 6:37pm

To answer the original question: What housekeeping do you perform

Consistency checks (DBCC)
Backups
Data Gathering: Disk utilization/consumption rate
Data Gathering: CPU utilization over time
Data Purge: Email
Data Purge: Disk, CPU, etc historical data after XXX days (See above)
Data Gathering: Performance Objects (Stored procedures by CPU, IO and Elapsed time)
Rebuild indexes/Update statistics (Ola Hallstrom)
Disk Partitioning: Slide window
Data aggregation in support of reporting

Kristen · September 11, 2015, 10:43am

Surprised no one has mentioned that - I realised I had omitted that from my original post and thought I'd wait and see if anyone noticed Good to hear I'm not typing to myself after all!

The other one is clearing down the history in MSDB (for which MS provide the SProc "msdb.dbo.sp_delete_backuphistory", which historically had absolutely dreadful performance if it was not run regularly and suddenly had to deal with a huge backlog of stale data to purge). You probably have that covered in your Data Purge

We have a sort-of-similar thing in that we have a Master Task scheduled for each database (actually a single SQL Agent Job with one STEP per database, with ON FAIL goto next step). The APP database has a Table with Sprocs to call. It calls each one in turn - that takes care of daily purging of stale data, the occasional "Batch processing" task for clearing down month end figures and the like. (Like SQL Agent Jobs our Tasks table has an "Anniversary date/time" and Year/Month/Day/Hour/Minute offsets to calculate the next run-date. Any task with a past-due Next Run On date/time is executed when the Master Task runs)

However, the nose-to-tail sequential nature of the process mean that we find that some tasks run very late - because earlier tasks take longer than expected. We run our FULL Backups using this method, so they run sequentially, but it can be an hour before the last database backup starts. (Poor example, because that task is at least consistent day-to-day so I can react to it if it becomes a problem, its more of a problem where a task takes an unexpectedly long time and the delay of subsequent tasks becomes critical).

We have now introduced three scheduled SQL Agent Jobs - We have one for Daily / overnight tasks, they can take hours & hours to run ... the other two, Regular and Frequent, allow tasks/SProcs to be assigned to each group according to how much "impact" they are likely to have. Individual Sprocs are still only triggered once their Next Run On date is past due, but we have less chance of a long-running task delaying the running of others in a way that upsets the users.

Typing this it now occurs to me that I should introduce a "Warn if takes longer than XXX".

I suppose I'm used to how-we-do-it ... but I like the fact that we have relatively few tasks in SQL Agent Jobs list. With a handful of 3rd Party APPs installed, all with different ideas about how to use SQL Agent Jobs (some call an Sproc, some have yards and yards of code in the Job) I have come to prefer moving the guts of the jobs out of SQL Agent into the APP itself.

One other problem we have with this is if a Child SProc fails with an error <= Level 16 - which terminates the connection, and thus all subsequent operations (because we are using SET XACT_ABORT ON in the Child SProc).

Hadn't thought of that as being a benefit, but we have Enable/Disable on individual rows in our Task Table, which achieves the same thing. As it happens I'm just about to disable an SProc task that is failing - its something I've been working on and clearly screwed up! its only doing some minor housekeeping so if I don't fix it for a week or two it won't matter ...

Talking of SQL Agent Jobs: the default is "Limit Size of Job History" based on "Maximum job history rows" and "Max history rows per job". Whilst I'm pleased that there is SOME history purging, out-of-the-box (unlike MSDB ) I don't think the choices are good ones. I have a job with one-step-per-database (this triggers our Daily Task) and it stores so little history as to be useless because the second limit ... and we also have a once-a-minute task which in just one day exceeds the 1,000 which triggers the first limit far too soon, so that causes deletion of all other job's history. I change the Job History Purge setting to "Remove job history after 4 weeks" [SSMS : SQL Server Agent :Properties : History]

Kristen · September 13, 2015, 11:50am

Good job this came up in this thread as it caused me to do some checking ...

If you deselect "Limit size of history log by rows / rows per job" and instead select "Remove Agent History older than N weeks / months" then it does NO cleanup and carries on history-logging forever.

This was a known bug in MS Connect which was logged in 2009, closed my MS in 2009 as "Known issue which will be fixed in the next major release" and is STILL a bug in SQL2014

Solution is to schedule msdb.dbo.sp_purge_jobhistory

DECLARE @MyCuttofDate datetime = DATEADD(Week, -4, GetDate())	-- Retain 4 weeks
EXEC msdb.dbo.sp_purge_jobhistory @oldest_date = @MyCuttofDate

and whilst I'm at it I will ensure that someone does not re-activate the "Limit size of history log by rows / rows per job"

EXEC msdb.dbo.sp_set_sqlagent_properties
	@jobhistory_max_rows = -1
	, @jobhistory_max_rows_per_job = -1

Kristen · September 13, 2015, 12:44pm

I had a look at msdb.dbo.sp_purge_jobhistory

The DOCs say (as of "SQL Server 2014, SQL Server 2016 Preview")

sp_purge_jobhistory 
   {   [ @job_name = ] 'job_name' | 
     | [ @job_id = ] job_id }
   [ , [ @oldest_date = ] oldest_date ]

and "Either job_id or job_name must be specified, but both cannot be specified"

But looking at the code this is clearly not the case - just an @oldest_date parameter is sufficient. here's the code - viewed using:

EXEC msdb.dbo.sp_helptext sp_purge_jobhistory

IF ((@job_name IS NOT NULL) OR (@job_id IS NOT NULL))
BEGIN
...
END
ELSE
BEGIN
	-- Only a sysadmin or SQLAgentOperatorRole can do this
	... fair enough ... there is a test for this with suitable RAISERROR() ...

	IF(@oldest_date IS NOT NULL)
	BEGIN
		DELETE FROM msdb.dbo.sysjobhistory
		WHERE ((run_date < @datepart) OR 
			(run_date <= @datepart AND run_time < @timepart))
	END
	ELSE
		... delete the lot ...

At the start of the Sproc the @oldest_date parameter is split into a pair of INTs @datepart & @timepart for the stupid [run_date] / [run_time] format columns that sysjobhistory uses)

AND ... what a rubbish bit of code is that?

		WHERE ((run_date < @datepart) OR 
			(run_date <= @datepart AND run_time < @timepart))

I looked at it and thought "surely if I pass a @oldest_date which includes a TIME componet then that second clause is going to retain any oldder date that had a LATER time in the day!!!

On closer checking the first clause will actual delete the earlier date, period. But surely?? the Summer Intern meant to write

( run_date = @datepart AND run_time < @timepart))