Only create backup if DBCC CHECKDB is OK

magmo · February 18, 2017, 8:13am

Hi, I have a daily backup job that run as a job in the sql agent. I would like to change it so it first check if the database first pass the DBCC CHECKDB([mydatabasename]) check, if it does then it can proceed with the backup but not if it returns errors.

If I run this DBCC CHECKDB([mydatabasename]), will that fail if it reports an error?

Kristen · February 18, 2017, 10:36am

Error checking of DBCC CEHCKDB is not entirely straightforward

You can use it in a SQL Agent Maintenance Plan I believe, but I'm not familiar with that

SET @sql = 'DBCC CHECKDB (' + QuoteName(@name) + ') WITH ... options ...'
EXEC sp_executesql @sql
IF @@ERROR <> 0
BEGIN
    .. Error Handler ...
END
... Continue with backup ...

The options I use are EXTENDED_LOGICAL_CHECKS and DATA_PURITY

But ... that said ... I personally would NOT stop backups because CHECKDB failed. The failure might be minor (and repairable), the failure might go unnoticed for some time - and then you have no backups, which might be critical in a number of ways - if you stop backing up the LOG (for FULL Recovery Model) it will grow continuously, might fill the disk, but would then need shrinking - which we try to avoid at all times.

Also, CHECKDB can be slow, and could become a scalability problem, so we prefer to do CHECKDB on a RESTORED copy of the database (on a non-critical server). That also proves that the database backup is actually restorable

Some other things I would look to, in terms of early-error-warning:

Set all databases to Page Verification = CHECKSUM rather than TORN_PAGE or, worse!!, None (You then need to rebuild all indexes to "apply" the Page Verification because only newly written pages will get a Checksum)

Set backups to "CONTINUE_AFTER_ERROR " - better, to my mind, to have something that might be rescuable rather than nothing.

Set backups to use WITH CHECKSUM (that both checks that the Page Verification is correct and generates a checksum for the entire backup)

I would also recommend that you turn on Compression of backups - there is a single, global, flag you can use (i.e. to change the default). Reduces I/O on both Backup and Restore, and of course file size / transmission time over LAN/WAN etc.

magmo · February 18, 2017, 11:41am

I changed so that a dbcc checkdb first is made, if that fails an email is sent out notifying db admin about that. If the DBCC Checkdb succeed the database backup will run. This job is run every day. The reason I stop making backup is that I don't want to backup a possible corrupt database, instead be notified and hopefully correct the issue if possible. Does that sound ok?

Kristen · February 18, 2017, 1:11pm

It sounds fine, but personally I would still make the backup for the reasons I have mentioned.

presumably you are backing up to uniquely named files, and therefore not overwriting any, existing, backup files? If so I don't see the additional (possibly corrupted) backup file being a problem.

magmo · February 18, 2017, 1:46pm

Currently I overwrite the exsisting backup file, therefore I dont want to overwrite the latest working copy I have

magmo · February 18, 2017, 2:10pm

By the way, I created a job that create backup filenames based on the date. But how do I configure sql server to only keep 5 backup copies. I only want the 5 latest days backup files to be saved on disk...

Kristen · February 18, 2017, 2:20pm

Sorry, but my advice is that you ABSOLUTELY should not do that. Keep a history of files online, by giving them unique names (e.g. including Date & Time). If anything goes wrong with your backup-to-same-file method you have nothing at all to restore back to. Maybe the file will be on tape ... but if they all have the same name that can lead to confusion when restoring (compared to the filename including the database name plus Date/Time of backup.

If you are using the Maintenance Plans I believe they can delete "stale" files after a given date.

If you are building your own solution then a history of backups is available in the MSDB database. You could use that as the basis for working out which filenames are old, and deleting them. You'd need to "remember" what date you last deleted up to

Or you could use a 3rd party SQL backup system. I favour Minion Backup because it is easy to configure - just installing it will provide a backup plan that is probably "good enough" for most people, but beyond that you can CONFIG it further.

Kristen · February 18, 2017, 2:24pm

P.S. If you are using FULL Recovery Model I recommend that you backup the Log file every 10 minutes, or more often. Don't leave it longer - there is no benefit, but great chance of data loss.

If you have changing data which is difficult (or impossible!) to recreate - e.g. people typing in details from phone calls and emails, rather than from formal paper-forms which are then kept - I recommend that you do set your database to Full Recovery model, because it a) allows restore to point-in-time (e.g. when a hardware fault occurred, or some customers were deleted accidentally) and b) it often allows a database to be recovered following a hardware fault that causes corruption to the MDF data file and with ZERO data loss - more particularly if the LDF log file is stored on separate disk spindle and if using separate disk controller, but often even when that is not the case.

jeffw8713 · February 18, 2017, 4:00pm

I disagree - this is a very dangerous approach. If you continue backing up a corrupt database - then very soon you will no longer have a known good backup available on disk to recover from...

You never stop transaction log backups due to integrity checks - because these are what you will use to recover from any corruption found. You need that known good backup and all transaction log backups up to the current point in time.

If you are offloading the integrity checks to another system due to size then you need to insure that someone is monitoring that process on a daily basis and verifying the success of the integrity checks. If you ever have a failure you need to immediately follow up with an integrity check in production - and start the process of recovering the system.

I would never count on any type of corruption to be easily repairable - in almost every case corruption is going to require a restore from backups and transaction log backups to recover with no data loss. Using any of the options to recover using DBCC CHECKDB will cause data loss and should be avoided unless there is absolutely no other option (e.g. no longer have a known good backup to recover from).

These options are on by default - unless you upgraded from a 2005 system and never ran DATA_PURITY on the upgraded database. Any databases created on a 2008 or higher system already performs the above checks unless you explicitly turn them off...

I would not recommend doing this either - this could cause you to bypass IO issues, storage issues - other issues with your backups without having any idea that your backups have been failing. If you have been having issues and then you find out your database is corrupt - when you go to restore your backup and it fails how are you going to restore the system?

Any issues that occur with maintenance of your database system need to be addressed immediately. Doing anything to bypass the errors is just asking for more trouble - up to and including the loss of the business. You have to make sure that any issue at all is immediately followed up on and corrected.

The recommendations for page verify and backups to use CHECKSUM are extremely valuable. If you can setup an AlwaysOn Availability Group (or Basic Availability Group) - then a feature of the mirroring solution allows for automatic page repair (Automatic page repair for availability groups & database mirroring - SQL Server Always On | Microsoft Learn). This is not a replacement for integrity checks but could save you if there are any issues with your IO subsystem.

The goal of your backup solution is to make sure you can recover the system - up to a point in time. If at any point you don't have a known good backup you no longer have the ability to recover the system. It does no good having a backup that cannot be restored or requires the restore to continue after error.

The most important part is that as soon as there is an issue it needs to be addressed. Letting any issue go for any length of time makes it harder to recover with no data loss.

Kristen · February 18, 2017, 5:11pm

In such a situation would the Admin be any more likely (or, indeed "less likely") to know that backups are not happening because of CHECKDB detection than because of not making the backup at all?

If there is any reporting of the error (which of course there should be) then that is enough to act on it. I don't think stopping backups, when the corruption may go from and to worse, is a safe strategy - nor any more likely to be the catalyst for action.

If the DB is in Full Recovery, and the TLog backups are clean, then there is the ability to have a zero-data-loss restore. If not then there as I see it there are three choices:

Restore from last-clean-backup, and repeat the work since.
Restore from the most recent corrupted backup with tolerable damage
Carry on - i.e. the damage-repair of current database is tolerable.

I'm not planning on ever being in the situation of having to make any of those choices, but if push came to shove I would like to have option 2 as well.

I agree. Yet we get a fair number of questions on this forum along the lines of "I have to repair my database" - usually the backups are non-existent (why? ) or too far out of date (Why x2?)

Thanks, I didn't know that. (My databases have probably all migrated from pre-2005, although I will have run DATA_PURITY at the point of migration)

I should have said that I was assuming that the error would be noticed. I still take the view that if errors are not being noticed then it is not likely that the absence of a backup file will be noticed either!

My biggest worry always is whether anything that can go wrong will adequately announce the fact. Over the years we have "improved" such notifications, and each time I have thanked my lucky stars that we hadn't had a breakdown that would have fallen through that crack.

magmo · February 19, 2017, 10:28am

I have now made some changes to the backup routine, the job look like this...

Check Data DBCC CHECKDB
If success proceed with backup
If failure alert db admin with email
Backup files have date and time stamp.
This job runs every day.

There is also a script that run through the task scheduler that delete all bak files older than 5 day so that I only keep a 5 days history.

Kristen · February 19, 2017, 10:42pm

Looks good

Don't want to labour the point, but if you have hard-to-reproduce data entry then FULL Recovery model and Log Backups every 10 minutes would be advisable.

I would also suggest a fall-back plan in case the emails fail (address changes in several years time and no one remembers to change this, or SMTP server fails / unavailable at the same time).

We have a separate process that queries MSDB to check the most recent backup date of all databases is "reasonable".

A trail-restore of the backup files (automated is fine) is another good safety-test.

You can of course overdue the paranoia, but my experience tells me that these things fail in unexpected ways