Went to get last night's backup off a (not commonly visited) remote server for some DEV work ... most recent backup was 20-Sep-2017. Yeah, I did the whole "WTF" routine ...
Went into SQL Agent and looked at the History ... Bizarre! ... its still running ... It complete the backup correctly (on 20-Sep) but then the next step has stalled - looks like its an "In Use File" that it is (still!!) waiting on.
Blinking accident-waiting-to-happen and NOT (for us at least) something that we have ever anticipated might happen. We check for "Terminated with error" but not "Job did not start" - of course the "Next start date" for the job is probably useless (I haven't checked yet) as I am assuming it won't, yet, have been set - because the job will be set to not start if it is already running.
Something else to watch out for - or maybe this is not a problem that you could have/ If so "why is that" please?
Like you, I have the backup stored procedure that I wrote that sends alerts if a given backup fails. And, like you, there's no way it can detect when the job gets stuck. It's not a problem for me, though, because I have another job that generates and sends me an email every morning that produces a report for all jobs on the server.and I review it every morning.
Thanks Jeff. I'm overdue to build a Monitoring Tool.
Some time ago we decided that Notification-on-Error was a poor way of knowing that something had failed - Email address no longer relevant, SMTP server also down, Email system stupidly thought it was SPAM, and so on.
Notification-on-Success is differently stupid - getting a hundred "It went OK" emails would never equip me to remember that ONE was missing.
So our plan (which I think I wrote about in another thread) is to have a single central server which will issue a remote query to each other server. That query can run whatever is appropriate on that server - "How recent are all the database backups" might well be generic, but "When did the Mainframe Import last run successfully" might be bespoke to that server.
Each of these tests would have some Metadata associated with it (e.g. a GUID for an ID, and the Date of the last successful run) which could be stored, and reported on, centrally, but essentially as a primary requirement, we would just need to know that all the tests were within bounds / successful. Also that all such tests had completed, and when.
This would catch the "stuck backup", as your "report of all jobs" does, but it would be a single report, for all servers, and no need to know that I should be expecting N-emails, there would only be one report, listing all failures - if any
In house 4, at our hosting another 4 or so, and at client sites "quite a few"
All the Client Site ones are reachable from our office (i.e. our IP is "approved" on their firewall) so we can direct-connect our SQL to their SQL (linked server, whatever), so I figure I can run an SProc on each server and just record the results into a local table; add to that some expected results (so I can alert if we should have had "Test-5" from "Server-B" and haven't had a value for that), and then I think??!! I should be good-to-do.
... there's an (unexpected, of course!) Duplicate occurring in an SProc (causing terminal error) on a job which runs VERY frequently. My Inbox is filling up faster than I can delete them !!
I figure to add another Feature to my Meta Table to cause any failed notification to be capable of being suspended. I expect it would be dangerous to turn it off! in case I forgot to turn it back on again, but I think a cutoff-time for "Do not notify again until PLUS X-HOURS" would be worthwhile
Not quite sure how to stop a job notifying error though ... perhaps it needs a "On error goto step X" and that step can, itself, trigger Error (which will cause Notification) UNLESS Notification is temporarily disabled in my Meta Table.
Don't much like the idea of the additional Conditional Steps in the job though ... adds complexity
Possible alternative route is to NOT have Notification on that job, but another Job that checks the Logs and sends an Alert for any failed job (having first checked in the Meta Table if that Job currently has notifications suspended
Yeah, I've experienced a similar issue. We use a third-party utility to de-dupe backups and every now and then it just decides to stop running backups. Very annoying.
I added some functionality to Is It SQL (http://www.scalesql.com/isitsql/) to tell me any databases that don't have appropriate backups. I check that every few days and tell the backup team.