Predictably sharing the network is critical to achieving
high utilization in the datacenter. Past work has focussed
on providing bandwidth to endpoints, but often
we want to allocate resources among multi-node services.
In this paper, we present Parley, which provides
service-centric minimum bandwidth guarantees, which
can be composed hierarchically. Parley also supports
service-centric weighted sharing of bandwidth in excess
of these guarantees. Further, we show how to configure
these policies so services can get low latencies even at
high network load. We evaluate Parley on a multi-tiered
oversubscribed network connecting 90 machines, each
with a 10Gb/s network interface, and demonstrate that
Parley is able to meet its goals.
Brendan Gregg from Netflix kicked off our technical talks with an in-depth presentation on the power of using BPF to analyze performance on Linux systems. The extended Berkeley Packet Filter is a relatively new profiling tool in the performance engineer’s toolbox that lets analysts run extremely efficient profiling code in a VM in the kernel. Brendan showed us how to write a BPF program, examples of some useful metrics, and a powerful way to visualize results using Flamegraphs. In particular, he demonstrated how to measure how long threads were blocked and how the threads were ultimately woken up. By following a chain of wakeup events across threads, Brendan showed how BPF and Flamegraphs could be used to root-cause the source of blocked CPU threads through user and kernel code, often all the way down to the metal.
Guilin Chen shifted focus to backend server efficiency. At Facebook’s scale, even small regressions can have major implications for site efficiency. The team pushes massive amounts of code to production every week, and catching regressions early — without slowing down developer speed — is a big challenge. After a quick overview of the Facebook release process, Guilin stepped through the process for identifying and fixing regressions using AutoTriage. The team starts by logging performance-tracking metrics for products that they care about. Once a regression has been observed, the team uses Stack Trace Finder to map the regression to a candidate list of offending functions. The team then uses a tool called Pushed Commit Search to locate all diffs that introduced changes to the offending functions. A Diff Ranker algorithm quickly prioritizes diffs by their likelihood of having introduced the regression. With these steps chained together into the AutoTriage system, the team has largely automated the most tedious aspects of regression analysis
After wowing the audience with some surprise sleight-of-hand magic, Jim Roskind of Google gave us a taste of the power of gathering metrics at scale to guide performance engineering. Jim started his talk with an overview of client-side histograms. Histograms in Chromium are super-fast at runtime — a “slow” setup path allocates the histogram buckets and defines their dynamic range, but after setup everything is lock-free and lightning-quick. The framework has a simple developer API for bumping up counters, which lets engineers record metrics with as few as 2-3 lines of code. After an overview of their histogram framework, Jim showed off examples of successful investigations they’ve done into DNS resolution, TCP connection latency, UDP reachability, and the efficacy of FEC. These findings influenced the design of the QUIC network protocol, which is used heavily by Google.
With HTTP2 push, Facebook has built out a new client/server interaction model, which now makes it possible for the company’s Edge/FBCDN servers to ‘push’ required images and Live streams from the server for a News Feed story or on-going live stream. HTTP2 Server push features are now available to the public. This talk will cover how Facebook leverages HTTP2 to achieve lower latencies.
In this talk, Google will cover its pursue of a fair and meaningful Cloud benchmarking framework, PerfKit Benchmarker, from one of its performance engineers’ perspective. The talk will cover the challenges and pitfalls the team faced in defining what matters, in addition to common customer challenges, and share how they were tackled. It will also cover sampling challenges, processing, and storage of 3K samples/second, and the challenge to mine and visualize the data in a meaningful way.
Steve Robertson (YouTube) makes the case for next generation video delivery today, including an explanation of the fundamentals of video compression, a recipe for deployment in your own service, and ways you can help shape VP9’s successor to better meet your needs.
QUIC is a multiplexed transport protocol running over UDP. It builds on the success of SPDY and HTTP/2 to make the web faster (e.g. improving page load latency, reducing video playback buffering), and to provide a platform for internet-scale experimentation. This talk will give an overview of QUIC, how it’s used today at Google, including deploying QUIC for YouTube video streaming, and what lies ahead.
In large scale data centers, a single fault can lead to correlated failures of several physical machines and the tasks running on them, simultaneously. Such correlated failures can severely damage the reliability of a service or a job.
This paper models the impact of stochastic and correlated failures on job reliability in a data center. We focus on correlated failures caused by power outages or failures of network components, on jobs running multiple replicas of identical tasks. We present a statistical reliability model and an approximation technique for computing a job’s reliability in the presence of correlated failures.
In addition, we address the problem of scheduling a job with reliability constraints. We formulate the scheduling problem as an optimization problem, with the aim being to achieve the desired reliability with the minimum number of extra tasks. We present a scheduling algorithm that approximates the minimum number of required tasks and a placement to achieve a desired job reliability. We study the efficiency of our algorithm using an analytical approach and by simulating a cluster with different failure sources and reliabilities. The results show that the algorithm can effectively approximate the minimum number of extra tasks required to achieve the job’s reliability.