A Holistic Framework for Automated Configuration Recommendation for Cloud Service Monitoring
This work addresses the reliability and operational efficiency of cloud service monitoring for large-scale companies like Microsoft, representing an incremental improvement over existing manual processes.
The paper tackles the problem of manual and reactive monitor configuration in cloud services, which leads to coverage gaps and redundant alerts, by designing a modular recommendation framework that processes graph-structured service entities to suggest optimal configurations, demonstrating efficacy through experimentation on historical data and user studies at Microsoft.
Reliability of large-scale cloud services is critical for user satisfaction and business continuity. Despite significant investments in reliability engineering, production incidents remain inevitable, often leading to customer impact and operational overhead. In large cloud companies, multiple services are deployed across regions necessitating robust health monitoring systems. However, the current monitor configuration process is manual, largely reactive and ad hoc, resulting in gaps in coverage and redundant alerts. In this paper, we present a comprehensive study of monitor creation in Microsoft, identifying key components in the existing process. We further design a modular recommendation framework that processes the graph structured service entities to suggest optimal monitor configurations. Through extensive experimentation on historical data and user study of recommendations for production services at Microsoft, we demonstrate the efficacy of our approach in providing relevant recommendations for monitor configurations.