Alerting Best Practices

(notes from Next ’17)

The thin line between alerting and over-alerting

History of monitoring at Google

null

Monarch talk by John Banning

Stackdriver runs on top of Monarch.

Read More »

Advertisements

Trumpet: Timely and Precise Triggers in Data Centers

As data centers grow larger and strive to provide tight performance
and availability SLAs, their monitoring infrastructure
must move from passive systems that provide aggregated
inputs to human operators, to active systems that enable programmed
control. In this paper, we propose Trumpet, an
event monitoring system that leverages CPU resources and
end-host programmability, to monitor every packet and report
events at millisecond timescales. Trumpet users can express
many network-wide events, and the system efficiently detects
these events using triggers at end-hosts. Using careful design,
Trumpet can evaluate triggers by inspecting every packet at
full line rate even on future generations of NICs, scale to
thousands of triggers per end-host while bounding packet
processing delay to a few microseconds, and report events
to a controller within 10 milliseconds, even in the presence
of attacks. We demonstrate these properties using an implementation
of Trumpet, and also show that it allows operators
to describe new network events such as detecting correlated
bursts and loss, identifying the root cause of transient congestion,
and detecting short-term anomalies at the scale of a data
center tenant.

Source: http://www.cs.yale.edu/homes/yu-minlan/writeup/sigcomm16.pdf

M3A: Model, MetaModel, and Anomaly Detection in Web Searches

‘Alice’ is submitting one web search per five minutes, for three hours in a row−is it normal? How to detect abnormal search behaviors, among Alice and other users? Is there any distinct pattern in Alice’s (or other users’) search behavior? We studied what is probably the largest, publicly available, query log, containing more than 30 million queries from 0.6 million users. In this paper, we present a novel, user-and group-level framework, M3A: Model, MetaModel and Anomaly detection. For each user, we discover and explain a surprising, bi-modal pattern of the inter-arrival time (IAT) of landed queries (queries with user click-through). Specifically, the model Camel-Log is proposed to describe such an IAT distribution; we then notice the correlations among its parameters at the group level. Thus, we further propose the metamodel Meta-Click, to capture and explain the two-dimensional, heavy-tail distribution of the parameters. Combining Camel-Log and Meta-Click, the proposed M3A has the following strong points: (1) the accurate modeling of marginal IAT distribution, (2) quantitative interpretations, and (3) anomaly detection.

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45670.pdf

Data Monitoring at 250 Gbit/s with Facebook – YouTube

Inside Facebook, the team has always provided monitoring as a service. This allows them to keep application monitoring both approachable and powerful to serve use cases of different complexity. They enable realtime analysis, regressions and anomaly detection, as well as root-causing site-level issues to specific applications and nodes causing them within minutes. Being a radar and powering automations for Facebook Infrastructure is a big scalability challenge. Learn how Facebook scaled its real-time monitoring system 20x and now peaking at 250 Gbit/s ingestion rate. They’ll dive into the monitoring system’s architecture evolution and some of the problems they faced along the way. They’ll also discuss current challenges, including anomaly detection at scale, driving data exploration, and intelligent spam fighting.

Source: https://www.youtube.com/watch?v=Dnqfb4DXRT0