FC-ADL: Efficient Microservice Anomaly Detection and Localisation Through Functional Connectivity
This addresses operational complexities in microservice management for software engineers, offering a scalable solution to improve system reliability, though it is incremental by building on neuroscientific concepts for a specific domain.
The paper tackles the challenge of anomaly detection and localization in microservice architectures by proposing FC-ADL, an approach based on functional connectivity that efficiently characterizes time-varying dependencies, achieving top performance in detection and localization across various fault scenarios and demonstrating scalability on Alibaba's large-scale deployment.
Microservices have transformed software architecture through the creation of modular and independent services. However, they introduce operational complexities in service integration and system management that makes swift and accurate anomaly detection and localisation challenging. Despite the complex, dynamic, and interconnected nature of microservice architectures, prior works that investigate metrics for anomaly detection rarely include explicit information about time-varying interdependencies. And whilst prior works on fault localisation typically do incorporate information about dependencies between microservices, they scale poorly to real world large-scale deployments due to their reliance on computationally expensive causal inference. To address these challenges we propose FC-ADL, an end-to-end scalable approach for detecting and localising anomalous changes from microservice metrics based on the neuroscientific concept of functional connectivity. We show that by efficiently characterising time-varying changes in dependencies between microservice metrics we can both detect anomalies and provide root cause candidates without incurring the significant overheads of causal and multivariate approaches. We demonstrate that our approach can achieve top detection and localisation performance across a wide degree of different fault scenarios when compared to state-of-the-art approaches. Furthermore, we illustrate the scalability of our approach by applying it to Alibaba's extremely large real-world microservice deployment.