Clustering of Time Series Subsequences is meaningless: Implications for Previous and Future Research

Given the recent explosion of interest in streaming data and online algorithms, clustering of time series
subsequences, extracted via a sliding window, has received much attention. In this work we make a
surprising claim. Clustering of time series subsequences is meaningless. More concretely, clusters extracted
from these time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by
any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random.
While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has
never appeared in the literature. We can justify calling our claim surprising, since it invalidates the
contribution of dozens of previously published papers. We will justify our claim with a theorem, illustrative
examples, and a comprehensive set of experiments on reimplementations of previous work. Although the
primary contribution of our work is to draw attention to the fact that an apparent solution to an important
problem is incorrect and should no longer be used, we also introduce a novel method which, based on the
concept of time series motifs, is able to meaningfully cluster subsequences on some time series datasets

Source: http://www.cs.ucr.edu/~eamonn/meaningless.pdf

Advertisements

DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning

Anomaly detection is a critical step towards building a secure and
trustworthy system. Œe primary purpose of a system log is to
record system states and signi€cant events at various critical points
to help debug system failures and perform root cause analysis. Such
log data is universally available in nearly all computer systems.
Log data is an important and valuable resource for understanding
system status and performance issues; therefore, the various system
logs are naturally excellent source of information for online
monitoring and anomaly detection. We propose DeepLog, a deep
neural network model utilizing Long Short-Term Memory (LSTM),
to model a system log as a natural language sequence. Œis allows
DeepLog to automatically learn log paŠerns from normal execution,
and detect anomalies when log paŠerns deviate from the model
trained from log data under normal execution. In addition, we
demonstrate how to incrementally update the DeepLog model in
an online fashion so that it can adapt to new log paŠerns over time.
Furthermore, DeepLog constructs workƒows from the underlying
system log so that once an anomaly is detected, users can diagnose
the detected anomaly and perform root cause analysis e‚ectively.
Extensive experimental evaluations over large log data have shown
that DeepLog has outperformed other existing log-based anomaly
detection methods based on traditional data mining methodologies

Source: https://acmccs.github.io/papers/p1285-duA.pdf

PROCHLO: Strong Privacy for Analytics in the Crowd

The large-scale monitoring of computer users’ software
activities has become commonplace, e.g., for application
telemetry, error reporting, or demographic profiling. This
paper describes a principled systems architecture—Encode,
Shuffle, Analyze (ESA)—for performing such monitoring
with high utility while also protecting user privacy. The ESA
design, and its PROCHLO implementation, are informed by
our practical experiences with an existing, large deployment
of privacy-preserving software monitoring.
With ESA, the privacy of monitored users’ data is guaranteed
by its processing in a three-step pipeline. First, the data
is encoded to control scope, granularity, and randomness.
Second, the encoded data is collected in batches subject to
a randomized threshold, and blindly shuffled, to break linkability
and to ensure that individual data items get “lost in the
crowd” of the batch. Third, the anonymous, shuffled data is
analyzed by a specific analysis engine that further prevents
statistical inference attacks on analysis results.
ESA extends existing best-practice methods for sensitivedata
analytics, by using cryptography and statistical techniques
to make explicit how data is elided and reduced in
precision, how only common-enough, anonymous data is analyzed,
and how this is done for only specific, permitted purposes.
As a result, ESA remains compatible with the established
workflows of traditional database analysis.
Strong privacy guarantees, including differential privacy,
can be established at each processing step to defend
against malice or compromise at one or more of those steps.
PROCHLO develops new techniques to harden those steps,
including the Stash Shuffle, a novel scalable and efficient
oblivious-shuffling algorithm based on Intel’s SGX, and new
applications of cryptographic secret sharing and blinding.
We describe ESA and PROCHLO, as well as experiments
that validate their ability to balance utility and privacy.

Source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/9161f08531933ca6e401834898b4e33362c55430.pdf

TensorFlow Agents: Efficient Batched Reinforcement Learning in TensorFlow

We introduce TensorFlow Agents, an efficient infrastructure paradigm for
building parallel reinforcement learning algorithms in TensorFlow. We simu-
late multiple environments in parallel, and group them to perform the neural
network computation on a batch rather than individual observations. This
allows the TensorFlow execution engine to parallelize computation, without
the need for manual synchronization. Environments are stepped in separate
Python processes to progress them in parallel without interference of the global
interpreter lock. As part of this project, we introduce BatchPPO, an efficient
implementation of the proximal policy optimization algorithm. By open sourc-
ing TensorFlow Agents, we hope to provide a flexible starting point for future
projects that accelerates future research in the field.

Source: https://arxiv.org/pdf/1709.02878.pdf

Research Blog: Federated Learning: Collaborative Machine Learning without Centralized Training Data

Standard machine learning approaches require centralizing the training data on one machine or in a datacenter. And Google has built one of the most secure and robust cloud infrastructures for processing this data to make our services better. Now for models trained from user interaction with mobile devices, we’re introducing an additional approach: Federated Learning.

Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. This goes beyond the use of local models that make predictions on mobile devices (like the Mobile Vision API and On-Device Smart Reply) by bringing model training to the device as well.

Source: https://research.googleblog.com/2017/04/federated-learning-collaborative.html?m=1

In-Datacenter Performance Analysis of a Tensor Processing Unit​ TM

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC–called a Tensor Processing Unit (TPU)–deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS), and a large (28MiB) software-managed on-chip memory. The TPU’s deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, …) that help average throughput more than guaranteed latency. The lack of of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters’ NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X – 30X faster than its contemporary GPU or CPU with TOPS/Watt about 30X – 80X higher. Moreover, using the GPUs GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

Source: https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk/view

How Machines Make Sense of Big Data: an Introduction to Clustering Algorithms

Take a look at the image below. It’s a collection of bugs and creepy-crawlies of different shapes and sizes. Take a moment to categorize them by similarity into a number of groups.
This isn’t a trick question. Start with grouping the spiders together.
Done? While there’s not necessarily a “correct” answer here, it’s most likely you split the bugs into four clusters. The spiders in one cluster, the pair of snails in another, the butterflies and moth into one, and the trio of wasps and bees into one more.
That wasn’t too bad, was it? You could probably do the same with twice as many bugs, right? If you had a bit of time to spare — or a passion for entomology — you could probably even do the same with a hundred bugs.
For a machine though, grouping ten objects into however many meaningful clusters is no small task, thanks to a mind-bending branch of maths called combinatorics, which tells us that are 115,975 different possible ways you could have grouped those ten insects together. Had there been twenty bugs, there would have been over fifty trillion possible ways of clustering them.

Source: https://medium.freecodecamp.com/how-machines-make-sense-of-big-data-an-introduction-to-clustering-algorithms-4bd97d4fbaba