AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges
It addresses the problem of improving operational efficiency and availability in cloud IT systems for practitioners and researchers, but is incremental as a review paper.
This paper reviews the application of AI to IT operations (AIOps) on cloud platforms, categorizing key tasks like incident detection and root cause analysis, and discusses trends, challenges, and opportunities without presenting new experimental results.
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes, particularly in cloud infrastructures, to provide actionable insights with the primary goal of maximizing availability. There are a wide variety of problems to address, and multiple use-cases, where AI capabilities can be leveraged to enhance operational efficiency. Here we provide a review of the AIOps vision, trends challenges and opportunities, specifically focusing on the underlying AI techniques. We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful. We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions. We discuss the problem formulation for each task, and then present a taxonomy of techniques to solve these problems. We also identify relatively under explored topics, especially those that could significantly benefit from advances in AI literature. We also provide insights into the trends in this field, and what are the key investment opportunities.