Verifying Backups

Does anyone out there verify their backups?

We're thinking about doing this. But I'm questioning if it's necessary. I've never had an SQL Server restore fail. (except for obvious things like: insufficient disk space, going to a higher version, folder doesn't exist etc...)

At my last job I restored them every day, not just with VERIFYONLY. The peace of mind is worth it.

1 Like

Most definitely. Unless you want to find out one day you have a disaster and your backups are empty or corrupt

before I started working as a DBA, I worked for a company that found out the hard way that their backups hadn't worked correctly for a month.

It took 40 people just to get us back online and that took 10 days. It took those 40 people a month of hard work to collect spreadsheets and reports and all manner of other information to get us back close to where we thought we were.

The customer complaints were a nightmare during the entire evolution and there was no income for at all for the first 6 days to get back online.

In todays world, I do restores of my two big production databases every night. The Developers love it because they use it as a "final" test before we push code to test. The Test folks use it as a final test as well and they love it because the nightly restore ensures they have the latest for final test.

I do that on top of doing a check sum, having a special email sent to me if any backup fails, and I do checksums for the backups. I'm also working on reincorporating and old job that I had done a decade ago that give me a count of the backups that appear on disk for every database in the entire enterprise (its just a couple hundred DBs across 8 servers) that will return the counts on a morning report.

Remember, don't develop a backup plan. Develop a restore plan, which also includes making sure that the backups did occur and that they are viable. You cannot check to much or too often.

--Jeff Moden

2 Likes

Why did it happen?

To make a really long story short, a bit of code to update ~1000 customers, which had worked quite nicely in dev and test, did something odd when the code was deployed to production. It turned out to be a fault in SQL server that they came out with an emergency hot fix for.

The issue was that if parallelism occurred because of a large amount of data with "just the right indexes in place", SQL Server would ignore a WHERE clause that did an ISNULL comparison.

The code had a WHERE ISNULL() condition on the 1.2 milllion row customer table that should have only updated about 1000 customers. Instead, the code updated ALL the customers. Because the DBA didn't know how many rows were supposed to be updated, when it came back with a rowcount of 1.2 million, he committed the transaction.

It wasn't until the next morning after all the nightly snapshots had been made that anyone noticed the problem. They shut off public access to the database and tried to do a restore only to find there were no backups available for the last month.

As a bit of a sidebar, that incident proved that I was correct about deployment code. It MUST have hardcoded and displayed at run time expected row counts for every action taken in the code so that the DBA can make the final decision to either commit or rollback.

And, I have to tell you, there isn't an automatic deployment system in the world that would have caught this issue because we checked and a SELECT to determine the row count made the same error.

That was way back on SQL Server 2000. I've seen lot's more be wrong in SQL Server since so this type of problem is never going to go away. You MUST write your deployment code to be bomb proof and verifiable by some human means. If you don't, you end up with the mouse guarding the cheese and, MAN, can that mouse eat some cheese!

As for why the backups didn't happen, I wasn't made privy to that detail. It's almost not an important detail. What's important is that no one knew that the backups weren't actually happening for a whole bloody month. There wouldn't have been an issue if someone had checked once per day or read a system report for what the status of the badckups were. It was an absolutely stupid blunder on the part of the DBA and the Infrastructure Team.