In-Datacenter Performance Analysis of a Tensor Processing Unit​ TM

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC–called a Tensor Processing Unit (TPU)–deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS), and a large (28MiB) software-managed on-chip memory. The TPU’s deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, …) that help average throughput more than guaranteed latency. The lack of of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters’ NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X – 30X faster than its contemporary GPU or CPU with TOPS/Watt about 30X – 80X higher. Moreover, using the GPUs GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

Source: https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk/view

Flexible Network Bandwidth and Latency Provisioning in the Datacenter

Abstract
Predictably sharing the network is critical to achieving
high utilization in the datacenter. Past work has focussed
on providing bandwidth to endpoints, but often
we want to allocate resources among multi-node services.
In this paper, we present Parley, which provides
service-centric minimum bandwidth guarantees, which
can be composed hierarchically. Parley also supports
service-centric weighted sharing of bandwidth in excess
of these guarantees. Further, we show how to configure
these policies so services can get low latencies even at
high network load. We evaluate Parley on a multi-tiered
oversubscribed network connecting 90 machines, each
with a 10Gb/s network interface, and demonstrate that
Parley is able to meet its goals.

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43871.pdf

Trumpet: Timely and Precise Triggers in Data Centers

As data centers grow larger and strive to provide tight performance
and availability SLAs, their monitoring infrastructure
must move from passive systems that provide aggregated
inputs to human operators, to active systems that enable programmed
control. In this paper, we propose Trumpet, an
event monitoring system that leverages CPU resources and
end-host programmability, to monitor every packet and report
events at millisecond timescales. Trumpet users can express
many network-wide events, and the system efficiently detects
these events using triggers at end-hosts. Using careful design,
Trumpet can evaluate triggers by inspecting every packet at
full line rate even on future generations of NICs, scale to
thousands of triggers per end-host while bounding packet
processing delay to a few microseconds, and report events
to a controller within 10 milliseconds, even in the presence
of attacks. We demonstrate these properties using an implementation
of Trumpet, and also show that it allows operators
to describe new network events such as detecting correlated
bursts and loss, identifying the root cause of transient congestion,
and detecting short-term anomalies at the scale of a data
center tenant.

Source: http://www.cs.yale.edu/homes/yu-minlan/writeup/sigcomm16.pdf

Onix: A Distributed Control Platform for Large-scale Production Networks

Computer networks lack a general control paradigm,
as traditional networks do not provide any networkwide
management abstractions. As a result, each new
function (such as routing) must provide its own state
distribution, element discovery, and failure recovery
mechanisms. We believe this lack of a common control
platform has significantly hindered the development of
flexible, reliable and feature-rich network control planes.
To address this, we present Onix, a platform on top of
which a network control plane can be implemented as a
distributed system. Control planes written within Onix
operate on a global view of the network, and use basic
state distribution primitives provided by the platform.
Thus Onix provides a general API for control plane
implementations, while allowing them to make their own
trade-offs among consistency, durability, and scalability.

Source: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Koponen.pdf

Google Infrastructure Security Design Overview

This document gives an overview of how security is designed into Google’s technical infrastructure. This global scale infrastructure is designed to provide security through the entire information processing lifecycle at Google. This infrastructure provides secure deployment of services, secure storage of data with end user privacy safeguards, secure communications between services, secure and private communication with customers over the internet, and safe operation by administrators.
Google uses this infrastructure to build its internet services, including both consumer services such as Search, Gmail, and Photos, and enterprise services such as G Suite and Google Cloud Platform.
We will describe the security of this infrastructure in progressive layers starting from the physical security of our data centers, continuing on to how the hardware and software that underlie the infrastructure are secured, and finally, describing the technical constraints and processes in place to support operational security.

Source: https://cloud.google.com/security/security-design/resources/google_infrastructure_whitepaper_fa.pdf

NYC Tech Talk Series: How Google Backs Up the Internet – YouTube

Systems like GMail and Picasa keep massive amounts of data in the cloud, all of which has to be constantly backed up to prepare for the inevitable. Typical backup and recovery techniques don’t scale, so Google has devised new methods for securing unprecedented volumes of data against every type of failure.

There are many unique challenges, both obvious and subtle, in delivering storage systems at this scale; we’ll discuss these and their solutions as well as some alternatives that didn’t make the grade.

About the speaker: Raymond Blum leads a team of Site Reliability Engineers charged with keeping Google’s and its users’ data safe and durable. Prior to coming to Google he was the IT director for a hedge fund after spending a few lifetimes developing systems at HBO and on Wall Street. In his meager spare time he indulges his interests in robotics and home automation and reads too much science fiction.

Source: https://www.youtube.com/watch?v=eNliOm9NtCM

Failure Trends in a Large Disk Drive Population

It is estimated that over 90% of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis.
We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity.

Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.

Source: https://www.usenix.org/legacy/event/fast07/tech/full_papers/pinheiro/pinheiro_html/