NI AIFeb 15, 2024

X-lifecycle Learning for Cloud Incident Management using LLMs

Drishti Goel, Fiza Husain, Aditya Singh, Supriyo Ghosh, Anjaly Parayil, Chetan Bansal, Xuchao Zhang, Saravan Rajmohan

arXiv:2404.03662v112.636 citationsh-index: 28SIGSOFT FSE Companion

Originality Incremental advance

AI Analysis

This work addresses the tedious and manual process of incident management for on-call engineers in cloud services, though it appears incremental as it builds on existing LLM methods by adding more contextual data.

The paper tackles the problem of automating cloud incident management by using large language models (LLMs) to generate recommendations, showing that augmenting contextual data from multiple software development lifecycle stages improves performance for root cause recommendations and monitor ontology identification, with results demonstrated on a dataset of 353 incidents and 260 monitors from Microsoft.

Incident management for large cloud services is a complex and tedious process and requires significant amount of manual efforts from on-call engineers (OCEs). OCEs typically leverage data from different stages of the software development lifecycle [SDLC] (e.g., codes, configuration, monitor data, service properties, service dependencies, trouble-shooting documents, etc.) to generate insights for detection, root causing and mitigating of incidents. Recent advancements in large language models [LLMs] (e.g., ChatGPT, GPT-4, Gemini) created opportunities to automatically generate contextual recommendations to the OCEs assisting them to quickly identify and mitigate critical issues. However, existing research typically takes a silo-ed view for solving a certain task in incident management by leveraging data from a single stage of SDLC. In this paper, we demonstrate that augmenting additional contextual data from different stages of SDLC improves the performance of two critically important and practically challenging tasks: (1) automatically generating root cause recommendations for dependency failure related incidents, and (2) identifying ontology of service monitors used for automatically detecting incidents. By leveraging 353 incident and 260 monitor dataset from Microsoft, we demonstrate that augmenting contextual information from different stages of the SDLC improves the performance over State-of-The-Art methods.

View on arXiv PDF

Similar