I’ve been working in information security for about two decades — spanning attack and defense, across the public and private sectors — and the most consistent truth I’ve found is that people overwhelmingly misunderstand how information security works. Even worse, the common misconceptions are such an endemic problem that they’ve fueled a $75 billion industry, comprised largely of snake oil solutions that range from ineffective to outright harmful. That’s left us in a place where the vast majority of the tech sector is throwing their money away on security that just doesn’t work, while ignoring the basic practices and processes that actually do produce secure systems … but it doesn’t have to be this way.
A good experiment, if not commercially successful. .trust was designed by Alex Stamos and other smart people to operate like many of the recent web security innovations: operate securely, or block users. Though you can no longer purchase a .trust domain and the monitoring and testing services that enforce these policies, the policybook itself is a masterclass in web security, comprehensive, organized, and well written.
Operating distributed systems at scale requires an unusual set of skills—problem solving, programming, system design, networking, and OS internals—which are difficult to find in one person. At Google, we’ve found some ways to hire Site Reliability Engineers, blending both software and systems skills to help keep a high standard for new SREs across our many teams and sites, including standardizing the format of our interviews and the unusual practice of making hiring decisions by committee. Adopting similar practices can help your SRE or DevOps team grow by consistently hiring excellent coworkers.
In order to run the company’s numerous services as efficiently and reliably as possible, Google’s Site Reliability Engineering (SRE) organization leverages the expertise of two main disciplines: Software Engineering and Systems Engineering. The roles of Software Engineer (SWE) and Systems Engineer (SE) lie at the two poles of the SRE continuum of skills and interests. While Site Reliability Engineers tend to be assigned to one of these two buckets, there is much overlap between the two job roles, and the knowledge exchange between the two job roles is rather fluid.
Google is constantly changing our software to implement new, useful features for our users. Unfortunately, making changes is inherently risky. Google services are quite complex, and any new feature might accidentally cause problems for users. In fact, most outages of Google services are the result of deploying a change. As a consequence, there is an inherent tension between the desire to innovate quickly and to keep the site reliable. Google manages this tension by using a metrics-based approach called an unreliability budget, which provides an objective metric to guide decisions involving tradeoffs between innovation and reliability.