HomeCloud ComputingAdvancing anomaly detection with AIOps—introducing AiDice | Azure Weblog and Updates

Advancing anomaly detection with AIOps—introducing AiDice | Azure Weblog and Updates


This weblog submit has been co-authored by Jeffrey He, Product Supervisor, AIOps Platform and Experiences Staff.

In Microsoft Azure, we make investments large efforts in guaranteeing our companies are dependable by predicting and mitigating failures as shortly as we are able to. In large-scale cloud techniques, nevertheless, we should expertise surprising points merely as a result of huge scale of the system. Given this, utilizing AIOps to constantly monitor well being metrics is prime to operating a cloud system efficiently, as we’ve shared in our earlier posts. First, we shared extra about this in Advancing Azure service high quality with synthetic intelligence: AIOps. We additionally shared an instance deep dive of how we use AIOps to assist Azure within the protected deployment area in Advancing protected deployment with AIOps. As we speak, we share one other instance, this time about how AI is used within the discipline of anomaly detection. Particularly, we introduce AiDice, a novel anomaly detection algorithm developed collectively by Microsoft Analysis and Microsoft Azure that identifies anomalies in large-scale, multi-dimensional time sequence information. AiDice not solely captures incidents shortly, it additionally gives engineers with essential context that helps them diagnose points extra successfully, offering one of the best expertise doable for finish prospects.

Why are AIOps wanted for anomaly detection?

We want AIOps for anomaly detection as a result of the information quantity is just too massive to research with out AI. In large-scale cloud environments, we monitor an innumerable variety of cloud parts, and every part logs numerous rows of knowledge. As well as, every row of knowledge for any given cloud part may comprise dozens of columns such because the timestamp, the {hardware} sort of the digital machine, the technology quantity, the OS model, the datacenter the place the nodes internet hosting the digital machine keep in, or the nation. The construction of the information we’ve is basically multi-dimensional time sequence information, which accommodates an exponential variety of particular person time sequence as a result of varied mixtures of dimensions. Which means that iterating by means of and monitoring each single time sequence is solely not sensible—making use of AIOps is critical.

How did we method this, earlier than AiDice?

Earlier than AiDice, the way in which we dealt with anomaly detection in large-scale, high-dimensional time sequence information was to conduct anomaly detection on a specific set of dimensions that had been crucial. By specializing in a scoped subset, we might have the ability to detect anomalies inside these mixtures shortly. As soon as these anomalies had been detected, engineers would then dive deeper into the problems, utilizing pivot tables to drill down into the opposite dimensions not included to raised diagnose the difficulty. Though this method labored, we noticed two key alternatives to enhance the method. First, the outdated method required a whole lot of handbook effort by engineers to find out the precise pivot of anomalies. Second, the method additionally restricted the scope of direct monitoring by solely permitting us to enter a restricted variety of dimensions into our anomaly detection algorithms. Given these causes, Microsoft Analysis and Azure labored collectively to develop AiDice, which improves each of those areas.

How will we method this now with AiDice, and the way does it work?

Now with AiDice, we are able to robotically localize pivots on time sequence information even when dozens of dimensions on the identical time. This enables us so as to add much more attributes, whether or not that be the {hardware} technology or {hardware} microcode, the OS model, or the networking agent model. Although this makes the search area a lot bigger, AiDice encodes the issue as a combinatorial optimization drawback, permitting it to look by means of the area extra effectively than conventional approaches. Transient particulars of AiDice are described beneath, however to see a full clarification of the algorithm, please see the paper revealed on the ESEC/FSE ’20: twenty eighth ACM Joint European Software program Engineering Convention and Symposium on the Foundations of Software program Engineering (ESEC/FSE 2020).

Half 1: AiDice algorithm—formulation as a search drawback

The AiDice algorithm works by first turning the information right into a search drawback. Search nodes are shaped by beginning at a given pivot and constructing the relationships out to the neighbors. For instance, if we take a node, “Nation=USA, Datacenter=DC1, DiskType=SSD”, we are able to kind out the neighboring nodes by swapping, including, or eradicating a dimension-value pair, as proven within the diagram beneath.

This image shows how the search space is formed. On the left is a node graph. On the right is a zoomed in version to a specific set of nodes and arrows labeled with the relationships between the nodes.

Half 2: AiDice algorithm—goal perform

Subsequent, the AiDice algorithm searches by means of the search area in a wise method by maximizing an goal perform that emphasizes two key parts. First, the larger the sudden burst or change in errors, the upper AiDice scores the target perform. Second, the upper the proportion of the errors that happen on this pivot in relation to the whole variety of errors, the upper AiDice scores the target perform. For instance, if there are 5,000 complete errors that occurred, it’s extra essential to alert the person in regards to the pivot that went from 3000 errors to 4000 errors than the pivot that went from 10 to twenty errors.

Half 3: Customization of alerts to scale back noise

Subsequent, the alerts that AiDice produces should be filtered and customised to be much less noisy and extra actionable for the reason that outcomes to date are optimized from a mathematical perspective however haven’t but included area data across the which means of the enter information. This step can range extensively relying on the character of the enter information, however an instance might be that consecutive alerts that share the identical error code could also be grouped collectively to scale back the variety of complete alerts.

AiDice in motion—an instance

The next is an actual instance during which AiDice helped detect an actual concern early on. The small print are altered for confidentiality causes.

  • We utilized AiDice to observe low reminiscence error occasions in a sure sort of digital machine with greater than a dozen dimensions of attribute data alongside the fault depend, together with the area, the datacenter location, the cluster, the construct, the RAM, or the occasion sort.
  • AiDice recognized a rise within the variety of low reminiscence occasions on distinct nodes in a selected pivot, which indicated a reminiscence leak.

    • Construct=11.11111, Ram=00.0, ProviderName=Xxxxx-x-Xxxxxx, EventType=8888 (particulars have been altered for privateness).

  • When wanting on the combination pattern, this concern is hidden and with out AiDice it will take handbook effort to detect the precise location of the difficulty (see graphs beneath, information normalized for privateness).
  • The engineer answerable for the ticket appeared on the alert and a few instance instances proven within the alerts to shortly in a position work out what was occurring.

This image is a line chart of the aggregate trend in the low memory events for a certain type of VM 17 timestamps. Overall, the trend remains relatively stable over time.

This image is a line chart of the anomaly identified by AiDice in a particular pivot over 17 timestamps. Overall, the trend clearly exhibits an anomaly, starting low then constantly increasing.

On this real-world instance, AiDice was capable of detect a problem in a dimension mixture that was inflicting a selected error sort in an computerized trend, shortly and effectively. Quickly after, the reminiscence leak was found and Azure engineers had been capable of mitigate the difficulty.

Trying ahead

Trying forward, we hope to enhance AiDice to make Azure much more resilient and dependable. Particularly, we plan to:

  • Assist extra eventualities in Azure: AiDice is being utilized to many eventualities in Azure already, however the algorithm has room to enhance with respect to the varieties of metrics it may well function on. Microsoft Azure and the Microsoft Analysis crew are working collectively to help extra metric eventualities.
  • Put together extra information feeds in Azure for AiDice: Along with upgrading the AiDice algorithm itself to help extra eventualities, we’re additionally working so as to add supporting attributes to sure information sources to completely leverage the ability of AiDice.

Study extra

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments