(notes from Next ’17)
The thin line between alerting and over-alerting
History of monitoring at Google
Stackdriver runs on top of Monarch.
- Service level Objective:
- 99.9% uptime
- <250ms latency at 99th percentile
- Service level Indicator
- 99th percentile latency
- Service level Agreement
- Uptime < 99.9% is eligible for a partial refund
What is “good monitoring?”
- Trends – detect and analyze trends
- Compare over time – business metrics: how they compared to last season?
- Dashboards – at a glance introspection
- Debugging – can help debugging
The four golden signals:
- Errors or error ratio //User view
- Latency //User view
- Traffic //Service view
- Saturation //Level utilization of resources, service view.
Black Box vs White Box
Black Box (probing)
- Externally observed
- Fewer shared failures
- More reliable
- Uptime check
- Write data, read it back
- Exported from system itself
- Great detail
- Crucial for debugging
- Using custom or logs metrics
- Allocated vs resident memory
Sampling rate matters!
Don’t just look at percentiles either:
Alert on Symptoms
Symptoms: externally visible behavior of a system.
User gets HTTP response quickly
SLO: Threshold of acceptable behavior
JUAN Criteria – Put every alerting metric through to make sure it’s valid
|Judgement||Human input||“Always turn this knob”|
|Urgent||Timely resolution||“Can wait for next week”|
|Actionable||Something can be done||“Went away by itself”|
|Necessary||Problem ongoing or imminent||“Users won’t be affected”|
Humans have a finite stress budget. To deal with that, you can:
- Keep alerts down
- Provide help
Pattern recognition + fundamental laziness, if alerts are too frequent, they will be ignored.
Too tight: get burn out.
Too loose: users leave.
Three nines of uptime means .1% time you can add features or whatever if you’re going to hit your target.