Streamlining Those Database Alerts

Part 5 of the Do It Yourself Database Monitoring Series

Okay…I set out to spin up an article about chaos in database environments for this series.  In a short amount of time I had 9 pages of notes.  Too much.  Maybe for another day.  Decided to narrow the scope to alerting as it relates to the monitoring.

There are plenty of alerting configurations in the database world.  Some introduce overheard, redundancy, and disorder.   I’ve listed below some of the issues that I believe to be out there.

  • Redundant alerts. These are alerts that are superseded by another alert.  Say for example there is an alert for the services running, one for a ping test, another for connectivity to the database engine.  Why have all of these?  It doesn’t seem to make a lot of sense.  Only one alert is needed here and that is to check the connectivity.  Why?  Because if any of the other alerts are triggered, the connectivity alert will be triggered too.   Redundant alerts can also be those that are monitored by more than one means.  TIP: Think big picture with the alerts
  • Ignored alerts – These are alerts that are not responded to. No point in sending an alert if it is going to be ignored.  TIP: Change how you do things.
  • Alert spamming – Having the same alert sent out repeatedly until the problem is resolved. I understand that there are people that like this out there.  To them it is the squeaky wheel getting the grease; the problem the pesters the most.  My take is that only one alert is needed to respond to.  My issues with this is that it can overflow an inbox, may hid other issues, and generally just be a pain to deal with.  TIP: Trim what you can, reduced the potential for issues, or build your own monitoring tool. 
  • Alerts configured on each individual server - With something that is absolutely mission critical this can be a good route to take. May be justified too, if there is only one server where something needs to be monitored.  This is one of those areas where a balancing act comes into play.  My preference is to have the alerting in one centralized location.  If there are multitudes of instances, I would prefer not to be managing and configuring alerts on each one of these individually.   The reason I bring up the mission critical alerts is that having the alerts sent from a centralized location introduces a single point of failure.
  • Alerts on success - Good programming practice may be to check for the positive condition, but this isn’t the case in alerting. Well, if we rephrase the positive to equal something negative it is.   Do you really want an alert for every successful backup?  Over time these will all be ignored.  TIP: Monitor for the problem that arises when the event does not occur.
  • SQL Agent job failures - Sometimes a job can succeed, but a step within the job can fail. May want to be watching job steps instead.  TIP: Take advantage of the retry attempts with SQL Agent jobs before sending an alert.  In some situations it would be okay to try the job again.  Transaction log backups come to mind.
  • Required response times – There are cases where the alert needs to be responded to as soon as possible and sometimes it can wait. As we move forward here this will cause use to break the alerts into different categories.  TIP: Formally discuss what response is needed.
  • Alerts not cleared - This about a having a way to know/communicate when the alerting condition is resolved.
  • Multiple responders - people looking into the same alert at the same time. Sure, it’s good to have teamwork and people working together.  It’s not good to have multiple people looking into the same issue without knowing about the other person.   I like to call this ‘stepping on each other’s toes’.  TIP: Talk to each other or get the alert into an incident tracking system.  A simple email saying you are looking into the situation is a good place to start.
  • False positives – An alert being sent from erroneous information. This could be because the metrics for the alert are not pulled correctly or a glitch of some type.  
  • Recipients – Alerts being sent to people and groups of people that do not have any reason to be aware.  

I like to have everything as simple as possible with a good rate of effectiveness.  Perfection is tough to obtain.  Also want the signal to noise ratio to be 1:1.   The signal to noise ratio is the ratio of responses to alerts.  One required response to 5 alerts is a 1:5 signal to noise ratio.    This ratio really is about keeping tasks and issues in order.  Too much noise and sight is lost of what is important or not.

Okay, it’s time, let’s get started with some high-level requirements for the alerting piece of the monitoring tool:

  • Centralize the alerting into one location and not on each individual server. Failed jobs are included here as well.  Avoid setting the alerts on each server.
  • Maintain the history of alerts.
  • Classify the alerts into different categories that will be handled differently
    • Alert – An event that needs immediate attention. Creates an incident.
    • Warning – Something to be aware of. A filegroup at 80% capacity may be a decent example.  Incident not created, but needs to be seen.
    • Notification – Information purposes only. This could be that an alert is cleared
    • Report only – Not sure what to call these items that could be consolidated into one email, web page, report, etc.  Items could include having a user created, a file grew, or maybe even the number of times a job failed within a period of time.   
  • Send the alert to an incident, or help desk application. This allows for someone to become the owner of the issue, and to allow for documenting the situation.  Future reference, proof of activity, repeatable notes, etc.  Should also be able to have the incident closed automatically if the situation resolves itself.
  • Have the ability to add or remove servers and/or specific databases from sending the alerts. 
  • Configurable thresholds for alerts and warnings:
    • Default threshold values
    • Number of times the failure (consecutive) occurs before an alert is sent.
    • Thresholds to trigger the alert and warning (if applicable). Examples include lag time, percentage of space, etc.
  • The capacity to create different groups to send alerts out to. Have the alerting groups configurable by alert, alert type, etc.  
  • All configurable settings in one location.  Not on individual servers or databases.

These requirements may look like a lot.  They're not.  But they will help get our environment in order.  By being able to classify the alerts and having thresholds we can make some magic happen.  For the record, having the alerting in one location does introduce a single point of failure.  When the alerting goes down then we aren't monitoring across the enterprise as we would like.