SEApr 29

Which Types of Heterogeneity Matter for Root Cause Localization in Microservice Systems ?

arXiv:2604.2667061.8
Predicted impact top 35% in SE · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners debugging microservice systems, this work provides a more effective root cause localization method by leveraging both data-level and entity-level heterogeneity.

The paper identifies that existing microservice root cause localization methods fail to capture the full diagnostic value of system heterogeneity. The proposed NexusRCL framework, which models services and hosts as distinct node types in a heterogeneous graph, achieves up to 49.85% improvement in Top-1 accuracy and 32.70% in Average Top-5 accuracy over state-of-the-art methods.

Microservice root cause localization is fundamentally challenged by the inherent heterogeneity of cloud-native systems, which encompasses diverse observability data and multiple system entities. Existing approaches typically focus on only one aspect of heterogeneity and thus fail to capture its full diagnostic value. In this work, we systematically examine the multifaceted role of heterogeneity within both microservice systems and the RCL process. This analysis motivates a deeper investigation into how entity-level distinctions and their asymmetric dependencies influence fault behavior. Our empirical analysis of two microservice benchmarks reveals that entity-level heterogeneity naturally gives rise to heterogeneous fault propagation, which is highly asymmetric and dominated by cross-layer interactions between services and hosts. In light of this, we propose NexusRCL, a semi-supervised framework that internalizes these propagation patterns by formalizing services and hosts as distinct node types within a heterogeneous graph. This design, coupled with an event-based abstraction mechanism, allows NexusRCL to effectively capture both data level and entity-level heterogeneity while minimizing labeling costs through active learning. Comprehensive evaluations on two industrial benchmark datasets demonstrate NexusRCL's superior performance, achieving improvements of up to 49.85\% in Top-1 accuracy (A@1) and 32.70\% in Average Top-5 accuracy (A@5) compared to state-of-the-art methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes