Todd Hollmann brought the Facebook IPv6 story into the data center and the different stages of IPv6 deployment. Facebook’s current data center fabric is all IPv6, but a few years ago, the team faced an interesting and unexpected dilemma: We had run out of 10.0.0.0/8 space! Assigning large prefixes to each rack made all the tooling and summarization easier, but it was wasteful. Facebook tried to model IPv6 like IPv4 where possible, but it turns out it wasn’t really possible because of the lack of proper support throughout the protocols. Further, Facebook decided to allocate a /64 network per rack, which seems a little excessive but winds up being efficient in terms of routing table lookups in ASICs and ECMP implementation for IPv6. Finally, Todd covered the challenges on the management plane with IPv6; traceroute, ping, SNMP, SSH, and other tools all initially had significant bugs with their IPv6 implementation.
Guilin Chen shifted focus to backend server efficiency. At Facebook’s scale, even small regressions can have major implications for site efficiency. The team pushes massive amounts of code to production every week, and catching regressions early — without slowing down developer speed — is a big challenge. After a quick overview of the Facebook release process, Guilin stepped through the process for identifying and fixing regressions using AutoTriage. The team starts by logging performance-tracking metrics for products that they care about. Once a regression has been observed, the team uses Stack Trace Finder to map the regression to a candidate list of offending functions. The team then uses a tool called Pushed Commit Search to locate all diffs that introduced changes to the offending functions. A Diff Ranker algorithm quickly prioritizes diffs by their likelihood of having introduced the regression. With these steps chained together into the AutoTriage system, the team has largely automated the most tedious aspects of regression analysis
Inside Facebook, the team has always provided monitoring as a service. This allows them to keep application monitoring both approachable and powerful to serve use cases of different complexity. They enable realtime analysis, regressions and anomaly detection, as well as root-causing site-level issues to specific applications and nodes causing them within minutes. Being a radar and powering automations for Facebook Infrastructure is a big scalability challenge. Learn how Facebook scaled its real-time monitoring system 20x and now peaking at 250 Gbit/s ingestion rate. They’ll dive into the monitoring system’s architecture evolution and some of the problems they faced along the way. They’ll also discuss current challenges, including anomaly detection at scale, driving data exploration, and intelligent spam fighting.