Generic Job Monitoring - across servers

Kristen · January 13, 2017, 10:26am

I'm looking for a means of monitoring "things that go wrong".

We can set up SQL Agent Jobs to email-on-failure, but that sometimes fails (target person is on holiday, email doesn't arrive, whatever), it also requires us to test that all Jobs for Failure [not always easy/possible] and that they deliver a suitable message to the right person, etc. Failure is rare, we need notification to work reliably!

We have Jobs that run without error, so no on-failure email generated, but some sort of User Action Requirement is logged. Perhaps the users never bother to view the logs / react ...

So I'm thinking of each Server having a table of "Jobs and Outcomes", and then each of our Jobs / Tasks can update that table with Start time, End time, and any error number / message.

We can then have a central report that queries all the remote / client servers for any results in the "Jobs and Outcomes" table that need attention. Perhaps a job duration is surprisingly long (or short ...), perhaps a job has an error, perhaps a job hasn't run at all ... or didn't complete.

One of the issues we face is with clients who have no maintenance contract with us, so we have no budget to monitor or fix anything. Definitely don't want a "Job failed" email from them, as that requires connecting to their server and "having a look" which may only be a few minutes but is probably not billable. We've had situations where a daily task failed to run for a couple of months before any of the users noticed (despite that we have user alerts / reports ...). however, I'd be happy that a failed job [for a non-maintenance client] appeared on a single report, of all clients / (failed) jobs etc., so that I could at least be aware of it - without any time-requirement on my part to make a diagnosis. (Of course I am in control of what gets logged into our "Jobs and Outcomes" table, so I can make any error message "obvious" to my colleagues to allow us to make an instant diagnosis.

Here's an example:

New IT support company appointed by Client. They set up a full-SQL--backup (to some remote location we can't reach). First time we needed to restore to investigate something (Client thought Fraud had been committed) we couldn't restore to specific day as our [differential] backup chain was broken (by the new Full [non-"Copy"] Backup. Clearly not a disaster, as in a real disaster we would probably have been able to work out that a Full Backup was missing and find where it was, but it would have been at a time that the Client was urgently needing the system back upright and the delay would have been costly.

So my solution would be to build an "Outcome" query that checks that all databases have recent backups, and that all backup files on disk have contiguous LSNs etc. so we would get early warning of any Muppet that created a backup outside of The System.

Does something like this already exist?
Do you tackle this a different way?
Any other suggestions?

Thanks.