DC NIMar 7

Uber's Failover Architecture: Reconciling Reliability and Efficiency in Hyperscale Microservice Infrastructure

Mayank Bansal, Milind Chabbi, Kenneth Bogh, Srikanth Prodduturi, Kevin Xu, Amit Kumar, David Bell, Ranjib Dey, Yufei Ren, Sachin Sharma, Juan Marcano, Shriniket Kale

arXiv:2603.07345v18.5h-index: 30

Predicted impact top 46% in DC · last 90 daysOriginality Highly original

AI Analysis

This work addresses the problem of balancing reliability and cost-efficiency in hyperscale microservice infrastructure for large-scale platforms like Uber, offering a significant improvement over traditional 2x capacity models.

Uber developed a Failover Architecture (UFA) to improve the efficiency of its hyperscale microservice infrastructure while maintaining reliability. By differentiating services based on criticality and optimizing capacity allocation, UFA reduced steady-state provisioning from 2x to 1.3x, increasing utilization from ~20% to ~30% and eliminating over one million CPU cores while sustaining 99.97% availability.

Operating a global, real-time platform at Uber's scale requires infrastructure that is both resilient and cost-efficient. Historically, reliability was ensured through a costly 2x capacity model--each service provisioned to handle global traffic independently across two regions--leaving half the fleet idle. We present Uber's Failover Architecture (UFA), which replaces the uniform 2x model with a differentiated architecture aligned to business criticality. Critical services retain failover guarantees, while non-critical services opportunistically use failover buffer capacity reserved for critical services during steady state. During rare "full-peak" failovers, non-critical services are selectively preempted and rapidly restored, with differentiated Service-Level Agreements (SLAs) using on-demand capacity. Automated safeguards, including dependency analysis and regression gates, ensure critical services continue to function even while non-critical services are unavailable. The quantitative impact is significant: UFA reduces steady-state provisioning from 2x to 1.3x, raising utilization from ~20% to ~30% while sustaining 99.97% availability. To date, UFA has hardened over 4,000 unsafe dependencies, eliminated over one million CPU cores from a baseline of about four million cores.

View on arXiv PDF

Similar