Flexible Network Bandwidth and Latency Provisioning in the Datacenter

Predictably sharing the network is critical to achieving
high utilization in the datacenter. Past work has focussed
on providing bandwidth to endpoints, but often
we want to allocate resources among multi-node services.
In this paper, we present Parley, which provides
service-centric minimum bandwidth guarantees, which
can be composed hierarchically. Parley also supports
service-centric weighted sharing of bandwidth in excess
of these guarantees. Further, we show how to configure
these policies so services can get low latencies even at
high network load. We evaluate Parley on a multi-tiered
oversubscribed network connecting 90 machines, each
with a 10Gb/s network interface, and demonstrate that
Parley is able to meet its goals.

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43871.pdf

Linux 4.x performance: Using BPF superpowers

Brendan Gregg from Netflix kicked off our technical talks with an in-depth presentation on the power of using BPF to analyze performance on Linux systems. The extended Berkeley Packet Filter is a relatively new profiling tool in the performance engineer’s toolbox that lets analysts run extremely efficient profiling code in a VM in the kernel. Brendan showed us how to write a BPF program, examples of some useful metrics, and a powerful way to visualize results using Flamegraphs. In particular, he demonstrated how to measure how long threads were blocked and how the threads were ultimately woken up. By following a chain of wakeup events across threads, Brendan showed how BPF and Flamegraphs could be used to root-cause the source of blocked CPU threads through user and kernel code, often all the way down to the metal.

Automatic regression triaging at Facebook

Guilin Chen shifted focus to backend server efficiency. At Facebook’s scale, even small regressions can have major implications for site efficiency. The team pushes massive amounts of code to production every week, and catching regressions early — without slowing down developer speed — is a big challenge. After a quick overview of the Facebook release process, Guilin stepped through the process for identifying and fixing regressions using AutoTriage. The team starts by logging performance-tracking metrics for products that they care about. Once a regression has been observed, the team uses Stack Trace Finder to map the regression to a candidate list of offending functions. The team then uses a tool called Pushed Commit Search to locate all diffs that introduced changes to the offending functions. A Diff Ranker algorithm quickly prioritizes diffs by their likelihood of having introduced the regression. With these steps chained together into the AutoTriage system, the team has largely automated the most tedious aspects of regression analysis

Evolution of high-performance networking in Chromium

After wowing the audience with some surprise sleight-of-hand magic, Jim Roskind of Google gave us a taste of the power of gathering metrics at scale to guide performance engineering. Jim started his talk with an overview of client-side histograms. Histograms in Chromium are super-fast at runtime — a “slow” setup path allocates the histogram buckets and defines their dynamic range, but after setup everything is lock-free and lightning-quick. The framework has a simple developer API for bumping up counters, which lets engineers record metrics with as few as 2-3 lines of code. After an overview of their histogram framework, Jim showed off examples of successful investigations they’ve done into DNS resolution, TCP connection latency, UDP reachability, and the efficacy of FEC. These findings influenced the design of the QUIC network protocol, which is used heavily by Google.

HTTP2 server push: Lower latencies around the world

With HTTP2 push, Facebook has built out a new client/server interaction model, which now makes it possible for the company’s Edge/FBCDN servers to ‘push’ required images and Live streams from the server for a News Feed story or on-going live stream. HTTP2 Server push features are now available to the public. This talk will cover how Facebook leverages HTTP2 to achieve lower latencies.

Benchmarking the cloud to build applications that work

In this talk, Google will cover its pursue of a fair and meaningful Cloud benchmarking framework, PerfKit Benchmarker, from one of its performance engineers’ perspective. The talk will cover the challenges and pitfalls the team faced in defining what matters, in addition to common customer challenges, and share how they were tackled. It will also cover sampling challenges, processing, and storage of 3K samples/second, and the challenge to mine and visualize the data in a meaningful way.