Trumpet: Timely and Precise Triggers in Data Centers

As data centers grow larger and strive to provide tight performance
and availability SLAs, their monitoring infrastructure
must move from passive systems that provide aggregated
inputs to human operators, to active systems that enable programmed
control. In this paper, we propose Trumpet, an
event monitoring system that leverages CPU resources and
end-host programmability, to monitor every packet and report
events at millisecond timescales. Trumpet users can express
many network-wide events, and the system efficiently detects
these events using triggers at end-hosts. Using careful design,
Trumpet can evaluate triggers by inspecting every packet at
full line rate even on future generations of NICs, scale to
thousands of triggers per end-host while bounding packet
processing delay to a few microseconds, and report events
to a controller within 10 milliseconds, even in the presence
of attacks. We demonstrate these properties using an implementation
of Trumpet, and also show that it allows operators
to describe new network events such as detecting correlated
bursts and loss, identifying the root cause of transient congestion,
and detecting short-term anomalies at the scale of a data
center tenant.

Source: http://www.cs.yale.edu/homes/yu-minlan/writeup/sigcomm16.pdf

PathNet: Evolution Channels Gradient Descent in Super Neural Networks

For artificial general intelligence (AGI) it would be efficient
if multiple users trained the same giant neural network, permitting
parameter reuse, without catastrophic forgetting.
PathNet is a first step in this direction. It is a neural network
algorithm that uses agents embedded in the neural network
whose task is to discover which parts of the network to
re-use for new tasks. Agents are pathways (views) through
the network which determine the subset of parameters that
are used and updated by the forwards and backwards passes
of the backpropogation algorithm. During learning, a tournament
selection genetic algorithm is used to select pathways
through the neural network for replication and mutation.
Pathway fitness is the performance of that pathway
measured according to a cost function. We demonstrate
successful transfer learning; fixing the parameters along a
path learned on task A and re-evolving a new population
of paths for task B, allows task B to be learned faster than
it could be learned from scratch or after fine-tuning. Paths
evolved on task B re-use parts of the optimal path evolved
on task A. Positive transfer was demonstrated for binary
MNIST, CIFAR, and SVHN supervised learning classification
tasks, and a set of Atari and Labyrinth reinforcement
learning tasks, suggesting PathNets have general applicability
for neural network training. Finally, PathNet also significantly
improves the robustness to hyperparameter choices
of a parallel asynchronous reinforcement learning algorithm
(A3C).

Source: https://arxiv.org/pdf/1701.08734.pdf

The Abuse Sharing Economy: Understanding the Limits of Threat Exchanges

The underground commoditization of compromised hosts suggests a tacit capability where miscreants leverage the same machine—subscribed by multiple criminal ventures—to simultaneously profit from spam, fake account registration, malicious hosting, and other forms of automated abuse. To expedite the detection of these commonly abusive hosts, there are now multiple industry-wide efforts that aggregate abuse reports into centralized threat exchanges. In this work, we investigate the potential benefit of global reputation tracking and the pitfalls therein. We develop our findings from a snapshot of 45 million IP addresses abusing six Google services including Gmail, YouTube, and ReCaptcha between April 7–April 21, 2015. We estimate the scale of end hosts controlled by attackers, expose underground biases that skew the abuse perspectives of individual web services, and examine the frequency that criminals re-use the same infrastructure to attack multiple, heterogeneous services. Our results indicate that an average Google service can block 14% of abusive traffic based on threats aggregated from seemingly unrelated services, though we demonstrate that outright blacklisting incurs an untenable volume of false positives.

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45491.pdf

TF.Learn: TensorFlow’s High-level Module for Distributed Machine Learning

TF.Learn is a high-level Python module for distributed machine learning inside TensorFlow
(Abadi et al., 2015). It provides an easy-to-use Scikit-learn (Pedregosa et al., 2011)
style interface to simplify the process of creating, configuring, training, evaluating, and
experimenting a machine learning model. TF.Learn integrates a wide range of state-ofart
machine learning algorithms built on top of TensorFlow’s low level APIs for small to
large-scale supervised and unsupervised problems. This module focuses on bringing machine
learning to non-specialists using a general-purpose high-level language as well as
researchers who want to implement, benchmark, and compare their new methods in a
structured environment. Emphasis is put on ease of use, performance, documentation, and
API consistency.

Source: https://arxiv.org/pdf/1612.04251v1.pdf

Slicer: Auto-Sharding for Datacenter Applications

Sharding is a fundamental building block of large-scale
applications, but most have their own custom, ad-hoc
implementations. Our goal is to make sharding as easily
reusable as a filesystem or lock manager. Slicer is
Google’s general purpose sharding service. It monitors
signals such as load hotspots and server health to dynamically
shard work over a set of servers. Its goals are
to maintain high availability and reduce load imbalance
while minimizing churn from moved work.

In this paper, we describe Slicer’s design and implementation.
Slicer has the consistency and global optimization
of a centralized sharder while approaching the
high availability, scalability, and low latency of systems
that make local decisions. It achieves this by separating
concerns: a reliable data plane forwards requests, and a
smart control plane makes load-balancing decisions off
the critical path. Slicer’s small but powerful API has
proven useful and easy to adopt in dozens of Google applications.
It is used to allocate resources for web service
front-ends, coalesce writes to increase storage bandwidth,
and increase the efficiency of a web cache. It
currently handles 2-7M req/s of production traffic. The
median production Slicer-managed workload uses 63%
fewer resources than it would with static sharding.

Source: https://www.usenix.org/system/files/conference/osdi16/osdi16-adya.pdf