Thanks Jeff. I’m overdue to build a Monitoring Tool.
Some time ago we decided that Notification-on-Error was a poor way of knowing that something had failed - Email address no longer relevant, SMTP server also down, Email system stupidly thought it was SPAM, and so on.
Notification-on-Success is differently stupid - getting a hundred “It went OK” emails would never equip me to remember that ONE was missing.
So our plan (which I think I wrote about in another thread) is to have a single central server which will issue a remote query to each other server. That query can run whatever is appropriate on that server - “How recent are all the database backups” might well be generic, but “When did the Mainframe Import last run successfully” might be bespoke to that server.
Each of these tests would have some Metadata associated with it (e.g. a GUID for an ID, and the Date of the last successful run) which could be stored, and reported on, centrally, but essentially as a primary requirement, we would just need to know that all the tests were within bounds / successful. Also that all such tests had completed, and when.
This would catch the “stuck backup”, as your “report of all jobs” does, but it would be a single report, for all servers, and no need to know that I should be expecting N-emails, there would only be one report, listing all failures - if any