Ubiq: A Scalable and Fault-tolerant Log Processing Infrastructure

Abstract. Most of today’s Internet applications generate vast amounts
of data (typically, in the form of event logs) that needs to be processed
and analyzed for detailed reporting, enhancing user experience and increasing
monetization. In this paper, we describe the architecture of
Ubiq, a geographically distributed framework for processing continuously
growing log files in real time with high scalability, high availability and
low latency. The Ubiq framework fully tolerates infrastructure degradation
and data center-level outages without any manual intervention. It
also guarantees exactly-once semantics for application pipelines to process
logs as a collection of multiple events. Ubiq has been in production
for Google’s advertising system for many years and has served as a critical
log processing framework for several dozen pipelines. Our production
deployment demonstrates linear scalability with machine resources, extremely
high availability even with underlying infrastructure failures, and
an end-to-end latency of under a minute.

Source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45805.pdf

Advertisements

Application Layer Transport Security (LOAS)

Production systems at Google consist of a constellation of microservices1 that collectively issue O(1010) Remote Procedure Calls (RPCs) per second. When a Google engineer schedules a production workload2, any RPCs issued or received by that workload are protected with ALTS by default. This automatic, zero-configuration protection is provided by Google’s Application Layer Transport Security (ALTS). In addition to the automatic protections conferred on RPC’s, ALTS also facilitates easy service replication, load balancing, and rescheduling across production machines. This paper describes ALTS and explores its deployment over Google’s production infrastructure.

Source: https://cloud.google.com/security/encryption-in-transit/application-layer-transport-security/resources/alts-whitepaper.pdf

Google Cloud Platform Blog: Google Compute Engine uses Live Migration technology to service infrastructure without application downtime

What’s remarkable about April 7th, 2014 isn’t what happened that day. It’s what didn’t.

That was the day the Heartbleed bug was revealed, and people around the globe scrambled to patch their systems against this zero-day issue, which came with already-proven exploits. In other public cloud platforms, customers were impacted by rolling restarts due to a requirement to reboot VMs. At Google, we quickly rolled out the fix to all our servers, including those that host Google Compute Engine. And none of you, our customers, noticed. Here’s why.

We introduced transparent maintenance for Google Compute Engine in December 2013, and since then we’ve kept customer VMs up and running as we rolled out software updates, fixed hardware problems, and recovered from some unexpected issues that have arisen. Through a combination of datacenter topology innovations and live migration technology, we now move our customers running VMs out of the way of planned hardware and software maintenance events, so we can keep the infrastructure protected and reliable—without your VMs, applications or workloads noticing that anything happened.

Source: https://cloudplatform.googleblog.com/2015/03/Google-Compute-Engine-uses-Live-Migration-technology-to-service-infrastructure-without-application-downtime.html

Securing Clouds

Notes on “Lessons Learned from Securing Google and Google Cloud” talk by Neils Provos

Summary

  • Defense in Depth at scale by default
    • Protect identities by default
    • Protect data across full lifecycle by default
    • Protect resources by default
  • Trust through transparency
  • Automate best practices and prevent common mistakes at scale
  • Share innovation to raise the bar, support and invest in the security community.
  • Address common cases programmatically
  • Empower customers to fulfill their security responsibilities
  • Trust and security can be the accelerant

Read More »

Titan: a custom TPM and more

I listened to a podcast and cut out the chit-chat, so you don’t have to:

Titan is a tiny security co-processing chip used for encryption, authentication of hardware, authentication of services.

Purpose

Every piece of hardware in google’s infrastructure can be individually identified and cryptographically verified, and any service using it mutually authenticates to that hardware. This includes servers, networking cards, switches: everything. The Titan chip is one of the ways to accomplish that.

The chip certifies that hardware is in a trusted good state. If this verification fails, the hardware will not boot, and will be replaced.

Every time a new bios is pushed, Titan checks that the code is authentic Google code before allowing it to be installed.  It then checks each time that code is booted that it is authentic, before allowing boot to continue.

‘similar in theory to the u2f security keys, everything should have identity, hardware and software. Everything’s identity is checked all the time.’

Suggestions that it plays important role in hardware level data encryption, key management systems, etc.

Hardware

Each chip is fused with a unique identifier. Done sequentially, so can verify it’s part of inventory sequence.

Three main functions: RNG, crypto engine, and monotonic counter. First two are self-explanatory. Monotonic counter to protect against replay attacks, and make logs tamper evident.

Sits between ROM and RAM, to provide signature valididation of the first 8KB of BIOS on installation and boot up.

Production

Produced entirely within google. Design and process to ensure provenance. Have used other vendor’s security coprocessors in the past, but want to ensure they understand/know the whole truth.

Google folks unaware of any other cloud that uses TPMs, etc to verify every piece of hardware and software running on it.

An Inside Look at Google BigQuery

This white paper introduces Google BigQuery, a fully-managed and cloud based interactive query service for massive datasets. BigQuery is the external implementation of one of the company’s core technologies whose code name is Dremel. This paper discusses the uniqueness of the technology as a cloudenabled massively parallel query engine, the differences between BigQuery and Dremel, and how BigQuery compares with other technologies such as MapReduce/Hadoop and existing data warehouse solutions

Note: This is from 2012, so doesn’t include a lot of recent innovations like upsert and standard sql support.

Source: https://cloud.google.com/files/BigQueryTechnicalWP.pdf

Lessons learned from B4, Google’s SDN WAN

Google’s B4 wide area network was first revealed several years ago. The outside observer might have thought, “Google’s B4 is finished. I wonder what they’re going to do next.” Turns out, once any network is in production @scale, there’s a continued need to make it better. Subhasree Mandal covered the reality of how Google iterated multiple times on different parts of B4 to improve its performance, availability, and scalability. Several of the challenges and solutions that Subhasree detailed were definitely at the intersection of networking and distributed systems. B4 was covered in a SIGCOMM 2013 paper from Google.