Software developers spend 35-50 percent of their time validating and debugging software. The cost of debugging, testing, and verification is estimated to account for 50-75 percent of the total budget of software development projects, amounting to more than $100 billion annually.11 While tools, languages, and environments have reduced the time spent on individual debugging tasks, they have not significantly reduced the total time spent debugging, nor the cost of doing so. Therefore, a hyperfocus on elimination of bugs during development is counterproductive; programmers should instead embrace debugging as an exercise in problem solving.
Brendan Gregg from Netflix kicked off our technical talks with an in-depth presentation on the power of using BPF to analyze performance on Linux systems. The extended Berkeley Packet Filter is a relatively new profiling tool in the performance engineer’s toolbox that lets analysts run extremely efficient profiling code in a VM in the kernel. Brendan showed us how to write a BPF program, examples of some useful metrics, and a powerful way to visualize results using Flamegraphs. In particular, he demonstrated how to measure how long threads were blocked and how the threads were ultimately woken up. By following a chain of wakeup events across threads, Brendan showed how BPF and Flamegraphs could be used to root-cause the source of blocked CPU threads through user and kernel code, often all the way down to the metal.
Guilin Chen shifted focus to backend server efficiency. At Facebook’s scale, even small regressions can have major implications for site efficiency. The team pushes massive amounts of code to production every week, and catching regressions early — without slowing down developer speed — is a big challenge. After a quick overview of the Facebook release process, Guilin stepped through the process for identifying and fixing regressions using AutoTriage. The team starts by logging performance-tracking metrics for products that they care about. Once a regression has been observed, the team uses Stack Trace Finder to map the regression to a candidate list of offending functions. The team then uses a tool called Pushed Commit Search to locate all diffs that introduced changes to the offending functions. A Diff Ranker algorithm quickly prioritizes diffs by their likelihood of having introduced the regression. With these steps chained together into the AutoTriage system, the team has largely automated the most tedious aspects of regression analysis
After wowing the audience with some surprise sleight-of-hand magic, Jim Roskind of Google gave us a taste of the power of gathering metrics at scale to guide performance engineering. Jim started his talk with an overview of client-side histograms. Histograms in Chromium are super-fast at runtime — a “slow” setup path allocates the histogram buckets and defines their dynamic range, but after setup everything is lock-free and lightning-quick. The framework has a simple developer API for bumping up counters, which lets engineers record metrics with as few as 2-3 lines of code. After an overview of their histogram framework, Jim showed off examples of successful investigations they’ve done into DNS resolution, TCP connection latency, UDP reachability, and the efficacy of FEC. These findings influenced the design of the QUIC network protocol, which is used heavily by Google.
Most network monitoring relies in the individual network devices themselves telling you that they are healthy or unhealthy via syslog messages, SNMP data, etc. In a Facebook scale network we just can’t trust the network devices to accurately report their health in all the possible failure cases that may exist. In addition to the standard network monitoring tools, we also actively probe our network with test traffic to ensure it’s behaving exactly as we expect. We can now find the network devices that don’t even know they are dropping packets even when they exist several layers deep inside the network.
In this talk, Google will cover its pursue of a fair and meaningful Cloud benchmarking framework, PerfKit Benchmarker, from one of its performance engineers’ perspective. The talk will cover the challenges and pitfalls the team faced in defining what matters, in addition to common customer challenges, and share how they were tackled. It will also cover sampling challenges, processing, and storage of 3K samples/second, and the challenge to mine and visualize the data in a meaningful way.
Updating production software is a process that may require dozens, if not hundreds, of steps. These include creating and testing new code, building new binaries and packages, associating the packages with a versioned release, updating the jobs in production datacenters, possibly modifying database schemata, and testing and verifying the results. There are boxes to check and approvals to seek, and the more automated the process, the easier it becomes. When releases can be made faster, it is possible to release more often, and, organizationally, one becomes less afraid to “release early, release often” [6, 7]. And that’s what we describe in this article—making rollouts as easy and as automated as possible. When a “green” condition is detected, we can more quickly perform a new rollout. Humans are still needed somewhere in the loop, but we strive to reduce the purely mechanical toil they need to perform.