Two-Tiered AI Engine

AdeptDC uses AI to uncover hidden insight from streaming logs and metrics data. The core prediction tasks we perform are: multivariate anomaly detection, alarm ranking, and similarity pair ranking. We mostly use a two-tiered AI engine which combines deep transfer learning with online learning.

Our first intelligent byte comes within 1 hr. of deployment. As with any AI engine, our prediction fidelity improves as more data flows in.

Data variety and volatility are the two most critical challenges we are handling for our AI deployment.

Two critical design principles for our AI engine are: high precision and high explainability.

  • High Precision: High precision means we will have minimum false alarms.

  • High Explainability: High explainability means a human operator can easily interpret the predictions from AI engine.

AI Conceptual Diagram

AI Stack

Anomaly Detection

Anomaly detection can potentially automate rule-based abnormality detection and directly create alarms from time series data.

Even though Borgmon remains internal to Google, the idea of treating time-series data as a data source for generating alerts is now accessible to everyone through those open source tools like Prometheus [...]

-Google SRE Handbook

Anomaly detection entails discovery of abnormal pattern in a dataset. For a large-scale dataset, the pattern is multi-variate and captured by different datatypes such as metrics, logs, and tracing. We are a metrics-first anomaly detection provider because metrics can capture long-term pattern in the data. The challenge of large-scale anomaly detection lies in data variety and volatility. Here is an example of Portworx dataset that we have worked on. We have dealt with about 1500 different metrics coming from 7 different source-types, including Cluster, Pool, Disk, Volume, Network, Proc, and Node. Following figures shows data coming from 4 different layers: Volume, Pool, Cluster, and Proc. Clearly, different source-types have different non-linear variability patterns and complex inter-dependencies. It is challenging to capture them with simple multi-variate statistics analysis. We are using non-linear time analysis to capture this complex interaction.

Here is an example of our anomaly detection result that captures the anomalies across different operating layers:

It is most likely you have an alarm where all for time series metrics have a spike. All other local anomalies are just minor deviation from the normalcy which can be possibly ignored. Our learning algorithm, over time, recognizes which anomalies are alarm-worthy and which are not.

In this context, another important feature comes in the form of runtime user feedback. Essentially, the runtime user feedback allows user to pass in dynamic execution context regarding the precision and recall of our anomaly classification algorithm.

A user can simply click on the anomalous point to mark it as a false positive. Our AI engine will automatically pick it up and adjust the baseline.

In addition to anomaly detection, our solution forecasts failure time which assists in triangulating failure risks better with failure time window.

In following chart, the anomaly is triggered at the interface of the green to yellow band. The failure happens when yellow band turns into red band. With the yellow band, a SRE can quickly recognize how much time they have before a risk turns into a failure.

We are also solving the problem around AI cold stat. Typical time series anomaly detection solution often takes 2-4 weeks to deliver first intelligent byte. We use an online learning algorithm to deliver time series anomalies within 1 hour of deployment.

Although we are focusing on time series data now, we will be focusing log anomaly detection in the future. We will also offer distributed tracing monitoring capability so that a user can quickly perform alarm investigation.

Time Series Ranking and Graph Centrality

A scale-out system often includes a large number of interrelated time series data. Recognizing which time series is most influential within a time series cohort uncovers significant insight about the relative relevance of an alarm and causality determination. We model this as problem of graph centrality learning problem where each node of the graph represents a time series.

Our graph learning algorithm computes relevancy score of each time series and ranks time series based on their relevance.

Time Series Similarity Pair Ranking and Graph Community Structure Detection

For large-scale system analyses, time series similarity pair ranking is a critical task especially for event correlation and root cause analysis. The standard correlation detection, however, takes a long time for large-scale system with heterogeneous time series data. We solve this problem with our robust time series correlation detection. Here is an example how we can capture a large number of similarity pair ranking for time series of dissimilar shapes and size.

We have validated our similarity pair ranking in a Nutanix storage cluster of following configuration:

Our algorithm can automatically figure out the inherent data model and assist in causality determination through automated time series similarity pair ranking. The dissimilar shape and size of the time series data makes traditional correlation detection algorithm ineffective. Our algorithm tackles this problem.

Not only do we perform correlations across heterogenous time series data, but we also provide a standardized ranking score with which an SRE can quickly assess the relative correlation strengths.

Log Analytics

A large-scale system analysis requires a multi-perspective view. That's why we are bringing log analytics into our platform with a goal to reduce manual burden of log file parsing. We provide an augmented view for log files, clustering log lines into red, green, yellow groups based on their similarity scores.