Clustering of Time Series Subsequences is meaningless: Implications for Previous and Future Research

Given the recent explosion of interest in streaming data and online algorithms, clustering of time series
subsequences, extracted via a sliding window, has received much attention. In this work we make a
surprising claim. Clustering of time series subsequences is meaningless. More concretely, clusters extracted
from these time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by
any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random.
While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has
never appeared in the literature. We can justify calling our claim surprising, since it invalidates the
contribution of dozens of previously published papers. We will justify our claim with a theorem, illustrative
examples, and a comprehensive set of experiments on reimplementations of previous work. Although the
primary contribution of our work is to draw attention to the fact that an apparent solution to an important
problem is incorrect and should no longer be used, we also introduce a novel method which, based on the
concept of time series motifs, is able to meaningfully cluster subsequences on some time series datasets


Causality in machine learning


Given recent advances and interest in machine learning, those of us with traditional statistical training have had occasion to ponder the similarities and differences between the fields. Many of the distinctions are due to culture and tools, but there are also differences in thinking which run deeper. Take, for instance, how each field views the provenance of the training data when building predictive models. For most of ML, the training data is a given, often presumed to be representative of the data against which the prediction model will be deployed, but not much else. With a few notable exceptions, ML abstracts away from the data generating mechanism, and hence sees the data as raw material from which predictions are to be extracted. Indeed, machine learning generally lacks the vocabulary to capture the distinction between observational data and randomized data that statistics finds crucial. To contrast machine learning with statistics is not the object of this post (we can do such a post if there is sufficient interest). Rather, the focus of this post is on combining observational data with randomized data in model training, especially in a machine learning setting. The method we describe is applicable to prediction systems employed to make decisions when choosing between uncertain alternatives.


Practical advice for analysis of large, complex data sets

For a number of years, I led the data science team for Google Search logs. We were often asked to make sense of confusing results, measure new phenomena from logged behavior, validate analyses done by others, and interpret metrics of user behavior. Some people seemed to be naturally good at doing this kind of high quality data analysis. These engineers and analysts were often described as “careful” and “methodical”. But what do those adjectives actually mean? What actions earn you these labels?

To answer those questions, I put together a document shared Google-wide which I optimistically and simply titled “Good Data Analysis.” To my surprise, this document has been read more than anything else I’ve done at Google over the last eleven years. Even four years after the last major update, I find that there are multiple Googlers with the document open any time I check.

Why has this document resonated with so many people over time? I think the main reason is that it’s full of specific actions to take, not just abstract ideals. I’ve seen many engineers and analysts pick up these habits and do high quality work with them. I’d like to share the contents of that document in this blog post.

The advice is organized into three general areas:
Technical: Ideas and techniques for how to manipulate and examine your data.
Process: Recommendations on how you approach your data, what questions to ask, and what things to check.
Social: How to work with others and communicate about your data and insights.


No shard left behind: APIs for massive parallel efficiency

Apache Beam (incubating) is a unified batch and streaming data processing programming model that is efficient and portable. Beam evolved from a decade of system-building at Google, and Beam pipelines run today on both open source (Apache Flink, Apache Spark) and proprietary (Google Cloud Dataflow) runners. This talk will focus on I/O and connectors in Apache Beam, specifically its APIs for efficient, parallel, adaptive I/O. Google will discuss how these APIs enable a Beam data processing pipeline runner to dynamically rebalance work at runtime, to work around stragglers, and to automatically scale up and down cluster size as a job’s workload changes. Together these APIs and techniques enable Apache Beam runners to efficiently use computing resources without compromising on performance or correctness. Practical examples and a demonstration of Beam will be included.

Apache Beam: A unified and open model for batch and streaming data processing

Frances Perry from Google spoke about Apache Beam. Through deft animations, she showed attendees how the seemingly hard problem of managing batch and streaming data sets within a common framework and system can be solved with a unified API. She framed the problem around a set of constraints and requirements on latency, completeness, and cost. This system handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtime environments.

Bayes and Big Data: The Consensus Monte Carlo Algorithm

A useful definition of “big data” is data that is too big to comfortably process on a single machine, either because of processor, memory, or disk bottlenecks. Graphics processing units can alleviate the processor bottleneck, but memory or disk bottlenecks can only be eliminated by splitting data across multiple machines. Communication between large numbers of machines is expensive (regardless of the amount of data being communicated), so there is a need for algorithms that perform distributed approximate Bayesian analyses with minimal communication. Consensus Monte Carlo operates by running a separate Monte Carlo algorithm on each machine, and then averaging individual Monte Carlo draws across machines. Depending on the model, the resulting draws can be nearly indistinguishable from the draws that would have been obtained by running a single machine algorithm for a very long time. Examples of consensus Monte Carlo are shown for simple models where single-machine solutions are available, for large single-layer hierarchical models, and for Bayesian additive regression trees (BART).


M3A: Model, MetaModel, and Anomaly Detection in Web Searches

‘Alice’ is submitting one web search per five minutes, for three hours in a row−is it normal? How to detect abnormal search behaviors, among Alice and other users? Is there any distinct pattern in Alice’s (or other users’) search behavior? We studied what is probably the largest, publicly available, query log, containing more than 30 million queries from 0.6 million users. In this paper, we present a novel, user-and group-level framework, M3A: Model, MetaModel and Anomaly detection. For each user, we discover and explain a surprising, bi-modal pattern of the inter-arrival time (IAT) of landed queries (queries with user click-through). Specifically, the model Camel-Log is proposed to describe such an IAT distribution; we then notice the correlations among its parameters at the group level. Thus, we further propose the metamodel Meta-Click, to capture and explain the two-dimensional, heavy-tail distribution of the parameters. Combining Camel-Log and Meta-Click, the proposed M3A has the following strong points: (1) the accurate modeling of marginal IAT distribution, (2) quantitative interpretations, and (3) anomaly detection.


Incremental, Iterative Data Processing with Timely Dataflow

We describe the timely dataflow model for distributed computation and its implementation in the Naiad system. The model supports stateful iterative and incremental computations. It enables both low-latency stream processing and high-throughput batch processing, using a new approach to coordination that combines asynchronous and fine-grained synchronous execution. We describe two of the programming frameworks built on Naiad: GraphLINQ for parallel graph processing, and differential dataflow for nested iterative and incremental computations. We show that a general-purpose system can achieve performance that matches, and sometimes exceeds, that of specialized systems.


Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing

Mesa is a highly scalable analytic data warehousing system
that stores critical measurement data related to Google’s
Internet advertising business. Mesa is designed to satisfy
a complex and challenging set of user and systems requirements,
including near real-time data ingestion and queryability,
as well as high availability, reliability, fault tolerance,
and scalability for large data and query volumes. Specifi-
cally, Mesa handles petabytes of data, processes millions of
row updates per second, and serves billions of queries that
fetch trillions of rows per day. Mesa is geo-replicated across
multiple datacenters and provides consistent and repeatable
query answers at low latency, even when an entire datacenter
fails. This paper presents the Mesa system and reports
the performance and scale that it achieves.


Data Monitoring at 250 Gbit/s with Facebook – YouTube

Inside Facebook, the team has always provided monitoring as a service. This allows them to keep application monitoring both approachable and powerful to serve use cases of different complexity. They enable realtime analysis, regressions and anomaly detection, as well as root-causing site-level issues to specific applications and nodes causing them within minutes. Being a radar and powering automations for Facebook Infrastructure is a big scalability challenge. Learn how Facebook scaled its real-time monitoring system 20x and now peaking at 250 Gbit/s ingestion rate. They’ll dive into the monitoring system’s architecture evolution and some of the problems they faced along the way. They’ll also discuss current challenges, including anomaly detection at scale, driving data exploration, and intelligent spam fighting.