SELGJan 10, 2023

Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

CMUIBMMIT
arXiv:2301.03797v295 citationsh-index: 75
Originality Synthesis-oriented
AI Analysis

This addresses the manual effort and domain knowledge required by on-call engineers in cloud incident management, representing an incremental application of existing LLMs to a new domain.

The paper tackles the problem of automating root-cause analysis and mitigation for cloud incidents by evaluating large language models (LLMs) like GPT-3.x on over 40,000 incidents at Microsoft, showing efficacy through human evaluation with incident owners.

Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes