SQLTeam.com | Weblogs | Forums

How to check (triple-check?!) that Backups are being made?


#1

How do you check that scheduled backups are indeed getting made?

A recent failure in a scheduled process recently left us with no backups on some databases and I would like to safeguard against that.

My thought was to have a "report" that checks the most recent backup date (in MSDB Backup History) and alert if any database has not been backed up "recently".

I have been very reluctant to get an email from every server just saying "All OK" because I'm not sure that anyone will spot if one-out-of-many emails is missing. Perhaps there is some way for each server to send some data to a central system and that system to detect, by exception, if any server has not sent in its report.

But maybe there is a better, simpler, way?


#2

We have something that runs on each server and sends an email if it spots an out-of-date backup. We also have something else that checks these jobs are running OK.

That way you don't get loads of "All OK" mails, which hopefully means you don't miss the ones that say "Please panic now".


#3

I'd be happy with that, but how do you guarantee that you get the email? That will be the one time that the SMTP is down or whatever ...

I've been toying with the idea of getting the server to send some data every hour to a central server. I have no idea, yet, how it would "send" that data ... all ideas welcome!

I might be able to PULL the data - we have permissions etc. to connect to all the Client servers, not sure I want to use Linked Servers, but i could have a BATch file that used SQLCMD or similar to query each server in turn.

The Absence of any data would be deemed to be a failure ...

This is the code that I have got so far (to check for missing backups). I reckon I also need a script for Failed SQL Agent jobs

-- Missing backups - List Databases with no recent backup
DECLARE	@dtFull		datetime = DateAdd(Day, -8, GetDate())
	, @dtDiff	datetime = DateAdd(Day, -2, GetDate())
	, @dtTLog	datetime = DateAdd(Hour, -2, GetDate())	-- Skip if SIMPLE Recovery Model
--
SELECT	FullDt
	, DiffDt
	, TLogDt
	, OtherTypes
	, D.is_read_only
	, D.recovery_model_desc
	, T_ErrMsg
	, D.name
FROM	master.sys.databases AS D
	LEFT OUTER JOIN
	(
		SELECT	database_name
			, [FullDt] = NullIf(MAX(CASE WHEN type = 'D' THEN backup_finish_date ELSE '19000101' END), '19000101')
			, [DiffDt] = NullIf(MAX(CASE WHEN type = 'I' THEN backup_finish_date ELSE '19000101' END), '19000101')
			, [TLogDt] = NullIf(MAX(CASE WHEN type = 'L' THEN backup_finish_date ELSE '19000101' END), '19000101')
--		BS.type = 'F' for File Group backups
			, [OtherTypes] = SUM(CASE WHEN type IN ('D', 'I', 'L') THEN 0 ELSE 1 END)
		FROM	msdb.dbo.backupset AS BS
		WHERE	backup_finish_date <= GetDate()	-- Do not include any future-date rows
		GROUP BY database_name
	) AS BS
		 ON BS.database_name = D.name
	CROSS APPLY
	(
		SELECT	[T_ErrMsg] = CASE WHEN FullDt > @dtFull THEN '' ELSE 'Stale FULL backup. ' END
				+ CASE WHEN DiffDt > @dtDiff THEN '' ELSE 'Stale DIFF backup. ' END
				+ CASE WHEN TLogDt > @dtTLog OR D.recovery_model_desc IN ('SIMPLE') THEN '' ELSE 'Stale LOG backup. ' END
				+ CASE WHEN OtherTypes = 0 THEN '' ELSE 'Invalid TYPE. ' END
	) AS X
WHERE	1=1
	AND D.name NOT LIKE 'RESTORE[_]%'	-- Ignore any dataasbe named "RESTORE_xxx"
	AND D.name NOT IN ('tempdb')
--	AND D.state = 0				-- Only if database is online
	AND NullIf(T_ErrMsg, '') IS NOT NULL	-- Only list Databases with errors
--
ORDER BY 
	CASE WHEN T_ErrMsg = '' THEN 2 ELSE 1 END,
	D.name

#4

If you're planning on use differentials and logs, don't you also need to skip/ignore any "COPY_ONLY" backup(s)?


#5

Although it's not a bad thing to start with, I wouldn't rely on anything from MSDB. I'd go to the files themselves and make sure that they had a correct header and at least passed a RESTORE with validate only. To check for any missing files in the transaction log chain, you might also want to compare LSN's from the file headers and make sure they've got you and the full/diff backups covered correctly.


#6

Good shout, thanks.

Really? That's a bummer (extra effort to check files etc. rather than rows in MSDB). If MSDB is missing a row, but there is actually a file, I'd get a false positive, which I can live with. Is it really likely that a backup fails to be made but the row exists in MSDB (i.e. a row with a date in [backup_finish_date])?

Ah ... the upside. I was thinking about LSN's and checked the rows in MSDB and thought "Maybe", but it so happens that the Client's IT folk have been doing direct backups of SQL (and NOT using a COPY ONLY method), including doing TLOg backups once-a-day (only) ... which was breaking the backup chain. Of course those backups ARE in MSDB so it looks like the chain is OK if I only check that ...

... so ... RATS! ... sounds like I do need to check the physical files. I suppose that would also bring to light if some Well meaning (I hate those people ... :slight_smile: ) person deletes some "old" files to make some disk space / whatever.

OK, "Must try harder" ... thanks folks, I'll report back - hopefully with some useful code. The stuff I fell over in Google was dreadful - just checking

backup_finish_date > GetDate()-2

or similar ... its been my experience that it is easy to get a clock error and then you have a future-dated row in the DB, so at the least the test needs to be

backup_finish_date > GetDate()-2
AND backup_finish_date <= GetDate()

#7

P.S. If I PULL data from each server I will get the chance to also check what variation I get on "now" date/time and I can also react to a clock being wrong


#8

Exactly. You could certainly check the file paths in MSDB to find out if someone else was making backups some place else but it won't tell you a thing about files being deleted because someone wanted to make some space or whatever. MSDB also won't tell you if someone put some other files on the backup drive because there was space available. There's nothing like have a weekly full backup suddenly start failing because there's no disk space left and whatever system someone wrote for the backups has written it so it does continue with other backups and doesn't provide proper notification of the failure.

There's also nothing like, as you say, counting on the fact that the last two weekly full backups, related DIFs, and related Log File backups are there according to MSDB only to find out that someone wrote something or made a change to delete anything and everything older than 7 days because of space crunches.