ANSC: Probabilistic Capacity Health Scoring for Datacenter-Scale Reliability
This addresses reliability issues for hyperscale datacenter operators by providing a more effective alerting system, though it appears incremental as it builds on existing alerting systems.
The paper tackles the problem of cascading capacity shortfalls in hyperscale datacenter fabrics by introducing ANSC, a probabilistic capacity health scoring framework that prioritizes issues based on the probability of imminent capacity violations, enabling operators to manage over 400 datacenters and 60 regions with reduced noise.
We present ANSC, a probabilistic capacity health scoring framework for hyperscale datacenter fabrics. While existing alerting systems detect individual device or link failures, they do not capture the aggregate risk of cascading capacity shortfalls. ANSC provides a color-coded scoring system that indicates the urgency of issues \emph{not solely by current impact, but by the probability of imminent capacity violations}. Our system accounts for both current residual capacity and the probability of additional failures, normalized at datacenter and regional level. We demonstrate that ANSC enables operators to prioritize remediation across more than 400 datacenters and 60 regions, reducing noise and aligning SRE focus on the most critical risks.