Anomaly detection can potentially automate rule-based abnormality detection and directly create alarms from time series data.
Even though Borgmon remains internal to Google, the idea of treating time-series data as a data source for generating alerts is now accessible to everyone through those open source tools like Prometheus [...]
-Google SRE Handbook
Anomaly detection entails discovery of abnormal pattern in a dataset. For a large-scale dataset, the pattern is multi-variate and captured by different datatypes such as metrics, logs, and tracing. We are a metrics-first anomaly detection provider because metrics can capture long-term pattern in the data. The challenge of large-scale anomaly detection lies in data variety and volatility. Here is an example of Portworx dataset that we have worked on. We have dealt with about 1500 different metrics coming from 7 different source-types, including Cluster, Pool, Disk, Volume, Network, Proc, and Node. Following figures shows data coming from 4 different layers: Volume, Pool, Cluster, and Proc. Clearly, different source-types have different non-linear variability patterns and complex inter-dependencies. It is challenging to capture them with simple multi-variate statistics analysis. We are using non-linear time analysis to capture this complex interaction.
Here is an example of our anomaly detection result that captures the anomalies across different operating layers:
It is most likely you have an alarm where all for time series metrics have a spike. All other local anomalies are just minor deviation from the normalcy which can be possibly ignored. Our learning algorithm, over time, recognizes which anomalies are alarm-worthy and which are not.
In this context, another important feature comes in the form of runtime user feedback. Essentially, the runtime user feedback allows user to pass in dynamic execution context regarding the precision and recall of our anomaly classification algorithm.
Although we are focusing on time series data now, we will be focusing log anomaly detection in the future. We will also offer distributed tracing monitoring capability so that a user can quickly perform alarm investigation.
A scale-out system often includes a large number of interrelated time series data. Recognizing which time series is most influential within a time series cohort uncovers significant insight about the relative relevance of an alarm and causality determination. We model this as problem of graph centrality learning problem where each node of the graph represents a time series.
Our graph learning algorithm computes relevancy score of each time series and ranks time series based on their relevance.
For large-scale system analyses, time series similarity pair ranking is a critical task especially for event correlation and root cause analysis. The standard correlation detection, however, takes a long time for large-scale system with heterogeneous time series data. We solve this problem with our robust time series correlation detection. Here is an example how we can capture a large number of similarity pair ranking for time series of dissimilar shapes and size.
We have validated our similarity pair ranking in a Nutanix storage cluster of following configuration:
Our algorithm can automatically figure out the inherent data model and assist in causality determination through automated time series similarity pair ranking. The dissimilar shape and size of the time series data makes traditional correlation detection algorithm ineffective. Our algorithm tackles this problem.
A large-scale system analysis requires a multi-perspective view. That's why we are bringing log analytics into our platform with a goal to reduce manual burden of log file parsing. We provide an augmented view for log files, clustering log lines into red, green, yellow groups based on their similarity scores.