Alerting Best Practices

(notes from Next ’17)

The thin line between alerting and over-alerting

History of monitoring at Google

null

Monarch talk by John Banning

Stackdriver runs on top of Monarch.

  • Service level Objective:
    • 99.9% uptime
    • <250ms latency at 99th percentile
  • Service level Indicator
    • Uptime
    • 99th percentile latency
  • Service level Agreement
    • Uptime < 99.9% is eligible for a partial refund

null

null

What is “good monitoring?”

Why monitor?

  • Trends – detect and analyze trends
  • Compare over time – business metrics: how they compared to last season?
  • Dashboards – at a glance introspection
  • Alerting
  • Debugging – can help debugging

The four golden signals:

  • Errors or error ratio //User view
  • Latency //User view
  • Traffic //Service view
  • Saturation //Level utilization of resources, service view.

Black Box vs White Box

Black Box (probing)

  • Externally observed
  • Independant
    • Fewer shared failures
    • More reliable
  • Examples
    • Uptime check
    • Write data, read it back

Whitebox

  • Exported from system itself
  • Great detail
    • Crucial for debugging
  • Examples
    • Using custom or logs metrics
    • Allocated vs resident memory

Sampling rate matters!

null

Don’t just look at percentiles either:

null

Alert on Symptoms

Symptoms: externally visible behavior of a system.

User gets HTTP response quickly

  • Errors
  • Latency

SLO: Threshold of acceptable behavior

JUAN Criteria – Put every alerting metric through to make sure it’s valid

Definition Antipattern
Judgement Human input “Always turn this knob”
Urgent Timely resolution “Can wait for next week”
Actionable Something can be done “Went away by itself”
Necessary Problem ongoing or imminent “Users won’t be affected”

Human element:

Humans have a finite stress budget. To deal with that, you can:

  • Keep alerts down
  • Provide help

Pattern recognition + fundamental laziness, if alerts are too frequent, they will be ignored.

Choosing SLOs

Too tight: get burn out.

Too loose: users leave.

Three nines of uptime means .1% time you can add features or whatever if you’re going to hit your target.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s