mmikecborg

Titan: a custom TPM and more

I listened to a podcast and cut out the chit-chat, so you don’t have to:

Titan is a tiny security co-processing chip used for encryption, authentication of hardware, authentication of services.

Purpose

Every piece of hardware in google’s infrastructure can be individually identified and cryptographically verified, and any service using it mutually authenticates to that hardware. This includes servers, networking cards, switches: everything. The Titan chip is one of the ways to accomplish that.

The chip certifies that hardware is in a trusted good state. If this verification fails, the hardware will not boot, and will be replaced.

Every time a new bios is pushed, Titan checks that the code is authentic Google code before allowing it to be installed. It then checks each time that code is booted that it is authentic, before allowing boot to continue.

‘similar in theory to the u2f security keys, everything should have identity, hardware and software. Everything’s identity is checked all the time.’

Suggestions that it plays important role in hardware level data encryption, key management systems, etc.

Hardware

Each chip is fused with a unique identifier. Done sequentially, so can verify it’s part of inventory sequence.

Three main functions: RNG, crypto engine, and monotonic counter. First two are self-explanatory. Monotonic counter to protect against replay attacks, and make logs tamper evident.

Sits between ROM and RAM, to provide signature valididation of the first 8KB of BIOS on installation and boot up.

Production

Produced entirely within google. Design and process to ensure provenance. Have used other vendor’s security coprocessors in the past, but want to ensure they understand/know the whole truth.

Google folks unaware of any other cloud that uses TPMs, etc to verify every piece of hardware and software running on it.

Lessons learned from B4, Google’s SDN WAN

Google’s B4 wide area network was first revealed several years ago. The outside observer might have thought, “Google’s B4 is finished. I wonder what they’re going to do next.” Turns out, once any network is in production @scale, there’s a continued need to make it better. Subhasree Mandal covered the reality of how Google iterated multiple times on different parts of B4 to improve its performance, availability, and scalability. Several of the challenges and solutions that Subhasree detailed were definitely at the intersection of networking and distributed systems. B4 was covered in a SIGCOMM 2013 paper from Google.

Networking between Earth and Mars

Last year, we learned about high-frequency financial trading from JPMorgan Chase and the nanoseconds that are important to that type of networking. This year, we went to the other extreme as we let Matt Damon (aka Luther Beegle) from the Jet Propulsion Laboratory take us off-planet by explaining the network operations involved in talking to the Mars rovers. When you have 24 minutes of round-trip time and your signal bounces through multiple satellite dishes and satellites in the Deep Space Network, then proper planning, monitoring, and error handling is critical. The science teams have only short windows to work in each day in terms of sending and receiving data using technology that was prepped a decade ago because of mission preparation times and long launch windows. (They also measure their throughput in late ’80s-style kilobits per second.) It’s inspiring to see what the science teams have accomplished a world away.

IPv6@Comcast

John Brzozowski has been a long-time IPv6 advocate, and he gave an overview of the advanced state of IPv6 for Comcast’s network. Today, it’s used in the majority of Comcast’s business needs, and it has seen IPv6 usage grow to over 25 percent of its internet-facing traffic. One extremely interesting revelation from John was that IPv6 will effectively become the underlay for all services in Comcast’s network, including IPv4 itself, so effectively it plans to implement IPv4 as a service. This is mainly to support legacy content and endpoints, but natively going with IPv6 has greatly simplified their problems and sidestepped all the complex operations involved in trying to run them as separate offerings.

Linux 4.x performance: Using BPF superpowers

Brendan Gregg from Netflix kicked off our technical talks with an in-depth presentation on the power of using BPF to analyze performance on Linux systems. The extended Berkeley Packet Filter is a relatively new profiling tool in the performance engineer’s toolbox that lets analysts run extremely efficient profiling code in a VM in the kernel. Brendan showed us how to write a BPF program, examples of some useful metrics, and a powerful way to visualize results using Flamegraphs. In particular, he demonstrated how to measure how long threads were blocked and how the threads were ultimately woken up. By following a chain of wakeup events across threads, Brendan showed how BPF and Flamegraphs could be used to root-cause the source of blocked CPU threads through user and kernel code, often all the way down to the metal.

Managing an open source program at scale

Open source software is fundamental to building a modern tech company. This panel discussion will feature companies widely recognized for running high-quality open source programs. Attendees will gain insight into the tooling, processes, and team structures these companies have built to manage open source programs that keep communities engaged at scale.

Featuring:

Jeff McAffer, Microsoft

Surupa Biswas, Facebook

Andrew Spyker, Netflix

Moderated by Brandon Keepers, GitHub

HTTP2 server push: Lower latencies around the world

With HTTP2 push, Facebook has built out a new client/server interaction model, which now makes it possible for the company’s Edge/FBCDN servers to ‘push’ required images and Live streams from the server for a News Feed story or on-going live stream. HTTP2 Server push features are now available to the public. This talk will cover how Facebook leverages HTTP2 to achieve lower latencies.

No shard left behind: APIs for massive parallel efficiency

Apache Beam (incubating) is a unified batch and streaming data processing programming model that is efficient and portable. Beam evolved from a decade of system-building at Google, and Beam pipelines run today on both open source (Apache Flink, Apache Spark) and proprietary (Google Cloud Dataflow) runners. This talk will focus on I/O and connectors in Apache Beam, specifically its APIs for efficient, parallel, adaptive I/O. Google will discuss how these APIs enable a Beam data processing pipeline runner to dynamically rebalance work at runtime, to work around stragglers, and to automatically scale up and down cluster size as a job’s workload changes. Together these APIs and techniques enable Apache Beam runners to efficiently use computing resources without compromising on performance or correctness. Practical examples and a demonstration of Beam will be included.