Scaling Backend Authentication at Facebook

Secure authentication and authorization within Facebook’s infrastructure play important
roles in protecting people using Facebook’s services. Enforcing security while maintaining a
flexible and performant infrastructure can be challenging at Facebook’s scale, especially in the
presence of varying layers of trust among our servers. Providing authentication and encryption
on a per-connection basis is certainly necessary, but also insufficient for securing more complex
flows involving multiple services or intermediaries at lower levels of trust.
To handle these more complicated scenarios, we have developed two token-based mechanisms
for authentication. The first type is based on certificates and allows for flexible verification due
to its public-key nature. The second type, known as “crypto auth tokens”, is symmetric-key
based, and hence more restrictive, but also much more scalable to a high volume of requests.
Crypto auth tokens rely on pseudorandom functions to generate independently-distributed keys
for distinct identities.
Finally, we provide (mock) examples which illustrate how both of our token primitives can be
used to authenticate real-world flows within our infrastructure, and how a token-based approach
to authentication can be used to handle security more broadly in other infrastructures which
have strict performance requirements and where relying on TLS alone is not enough.


A history of IPv6 challenges in Facebook data centers

Todd Hollmann brought the Facebook IPv6 story into the data center and the different stages of IPv6 deployment. Facebook’s current data center fabric is all IPv6, but a few years ago, the team faced an interesting and unexpected dilemma: We had run out of space! Assigning large prefixes to each rack made all the tooling and summarization easier, but it was wasteful. Facebook tried to model IPv6 like IPv4 where possible, but it turns out it wasn’t really possible because of the lack of proper support throughout the protocols. Further, Facebook decided to allocate a /64 network per rack, which seems a little excessive but winds up being efficient in terms of routing table lookups in ASICs and ECMP implementation for IPv6. Finally, Todd covered the challenges on the management plane with IPv6; traceroute, ping, SNMP, SSH, and other tools all initially had significant bugs with their IPv6 implementation.

Automatic regression triaging at Facebook

Guilin Chen shifted focus to backend server efficiency. At Facebook’s scale, even small regressions can have major implications for site efficiency. The team pushes massive amounts of code to production every week, and catching regressions early — without slowing down developer speed — is a big challenge. After a quick overview of the Facebook release process, Guilin stepped through the process for identifying and fixing regressions using AutoTriage. The team starts by logging performance-tracking metrics for products that they care about. Once a regression has been observed, the team uses Stack Trace Finder to map the regression to a candidate list of offending functions. The team then uses a tool called Pushed Commit Search to locate all diffs that introduced changes to the offending functions. A Diff Ranker algorithm quickly prioritizes diffs by their likelihood of having introduced the regression. With these steps chained together into the AutoTriage system, the team has largely automated the most tedious aspects of regression analysis

Data Monitoring at 250 Gbit/s with Facebook – YouTube

Inside Facebook, the team has always provided monitoring as a service. This allows them to keep application monitoring both approachable and powerful to serve use cases of different complexity. They enable realtime analysis, regressions and anomaly detection, as well as root-causing site-level issues to specific applications and nodes causing them within minutes. Being a radar and powering automations for Facebook Infrastructure is a big scalability challenge. Learn how Facebook scaled its real-time monitoring system 20x and now peaking at 250 Gbit/s ingestion rate. They’ll dive into the monitoring system’s architecture evolution and some of the problems they faced along the way. They’ll also discuss current challenges, including anomaly detection at scale, driving data exploration, and intelligent spam fighting.