DCMay 6

A Performance Analyzer for a Public Cloud's ML-Augmented VM Allocator

Roozbeh Bostandoost, Pooria Namyar, Siva Kesava Reddy Kakarla, Ryan Beckett, Santiago Segarra, Eli Cortez, Ankur Mallick, Kevin Hsieh, Rodrigo Fonseca, Mohammad Hajiesmaili, Behnaz Arzani

arXiv:2512.0775022.7h-index: 18

Predicted impact top 63% in DC · last 90 daysOriginality Incremental advance

AI Analysis

For cloud operators deploying multiple ML models in VM allocation, SANJESH provides a systematic way to detect failures from correlated distributional shifts across models.

SANJESH formulates a bi-level optimization to stress-test how multiple ML models in a cloud VM allocation pipeline can adversarially interact, uncovering scenarios that cause 4× worse performance than the operator's evaluator detected.

Cloud operators increasingly deploy multiple ML models in their VM allocation pipelines. In such settings, individually benign predictions can shift and compound, severely degrading performance. In a cloud provider's VM placement pipeline, CPU, memory, and lifetime prediction models jointly determine server count, live migration frequency, and network utilization; yet no existing approach can systematically stress-test how these models adversely interact. Deterministic adversarial analyzers cannot capture probabilistic ML behavior, so operators miss failures that arise only from correlated distributional shifts across models In SANJESH, we formulate a bi-level optimization that captures how the ML models behave statistically and uncovers how they adversely interact. The outer level searches over what predictions the ML models could produce under distributional uncertainty to find adversarial conditions; the inner level evaluates how the VM allocator behaves given those predictions. When we applied it to the operator's production traces, SANJESH uncovered scenarios that cause $4\times$ worse performance than the operators' evaluator detected.

View on arXiv PDF

Similar