DCMay 6

A Performance Analyzer for a Public Cloud's ML-Augmented VM Allocator

arXiv:2512.0775022.7h-index: 18
Predicted impact top 63% in DC · last 90 daysOriginality Incremental advance
AI Analysis

For cloud operators deploying multiple ML models in VM allocation, SANJESH provides a systematic way to detect failures from correlated distributional shifts across models.

SANJESH formulates a bi-level optimization to stress-test how multiple ML models in a cloud VM allocation pipeline can adversarially interact, uncovering scenarios that cause 4× worse performance than the operator's evaluator detected.

Cloud operators increasingly deploy multiple ML models in their VM allocation pipelines. In such settings, individually benign predictions can shift and compound, severely degrading performance. In a cloud provider's VM placement pipeline, CPU, memory, and lifetime prediction models jointly determine server count, live migration frequency, and network utilization; yet no existing approach can systematically stress-test how these models adversely interact. Deterministic adversarial analyzers cannot capture probabilistic ML behavior, so operators miss failures that arise only from correlated distributional shifts across models In SANJESH, we formulate a bi-level optimization that captures how the ML models behave statistically and uncovers how they adversely interact. The outer level searches over what predictions the ML models could produce under distributional uncertainty to find adversarial conditions; the inner level evaluates how the VM allocator behaves given those predictions. When we applied it to the operator's production traces, SANJESH uncovered scenarios that cause $4\times$ worse performance than the operators' evaluator detected.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes