My Philosophy on Alerting – Google Docs

Former SRE on optimal alerting processes.

Summary

When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier:

 

  • Pages should be urgent, important, actionable, and real.
  • They should represent either ongoing or imminent problems with your service.
  • Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.
  • You should almost always be able to classify the problem into one of: availability & basic functionality; latency; correctness (completeness, freshness and durability of data); and feature-specific problems.
  • Symptoms are a better way to capture more problems more comprehensively and robustly with less effort.
  • Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes.
  • The further up your serving stack you go, the more distinct problems you catch in a single rule.  But don’t go so far you can’t sufficiently distinguish what’s going on.
  • If you want a quiet oncall rotation, it’s imperative to have a system for dealing with things that need timely response, but are not imminently critical.

Source: My Philosophy on Alerting – Google Docs

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s