52.5NIMay 31
Understanding Cross-Cloud Interconnects: Hands-On Measurements and Cost OptimizationEitan Eliav, Isaac Keslassy, David Breitgand et al.
New services such as Google Cross-Cloud Interconnect (CCI) address the rise in fast and large-scale cross-cloud data transfers. CCI offers dedicated high-throughput links with low per-GB transfer costs, but also involves high fixed leasing fees and multi-day provisioning delays. This combination makes cost optimization difficult because traffic patterns are unpredictable. This paper presents the first comprehensive study of CCI-like services. We begin with an empirical characterization of CCI and its alternatives using direct measurements across AWS-GCP interconnects. We then introduce ToggleCCI, a new dynamic cost-optimization algorithm designed to handle provisioning delays and uncertainty in future demand. ToggleCCI adapts by switching between VPN and CCI based on cost trends observed over a sliding time window. We prove that ToggleCCI achieves asymptotic optimality under sustained high-demand or low-demand regimes. Finally, using real-world traffic traces, we show that ToggleCCI consistently tracks the best static policy for each scenario and delivers substantial cost savings.
NIMar 6
CrossCheck: Input Validation for WAN Control SystemsAlexander Krentsel, Rishabh Iyer, Isaac Keslassy et al.
We present CrossCheck, a system that validates inputs to the Software-Defined Networking (SDN) controller in a Wide Area Network (WAN). By detecting incorrect inputs - often stemming from bugs in the SDN control infrastructure - CrossCheck alerts operators before they trigger network outages. Our analysis at a large-scale WAN operator identifies invalid inputs as a leading cause of major outages, and we show how CrossCheck would have prevented those incidents. We deployed CrossCheck as a shadow validation system for four weeks in a production WAN, during which it accurately detected the single incident of invalid inputs that occurred while sustaining a 0% false positive rate under normal operation, hence imposing little additional burden on operators. In addition, we show through simulation that CrossCheck reliably detects a wide range of invalid inputs (e.g., detecting demand perturbations as small as 5% with 100% accuracy) and maintains a near-zero false positive rate for realistic levels of noisy, missing, or buggy telemetry data (e.g., sustaining zero false positives with up to 30% of corrupted telemetry data).
54.1NIMar 19
Congestion Control for Spraying with Congested PathsBarak Gerstein, Mark Silberstein, Isaac Keslassy
Packet spraying approaches are increasingly deployed in datacenter networks. However, their combination with existing congestion control algorithms (CCAs) may lead to poor QoS, especially when some of the paths are congested. In this paper, we first model the throughput collapse of a wide array of CCAs when some of the paths are congested. We explain that since CCAs are typically designed for single-path routing, their estimation function focuses on the latest feedback and mishandles feedback that reflects multiple paths. We propose using a median feedback that is more robust to the varying signals that come with multiple paths. We introduce MSwift and MNSCC, which apply this median principle to Google's Swift and Ultra Ethernet's NSCC. We demonstrate that they can improve both CCAs, reaching better QoS both under congested paths and in uncongested networks.
NIMar 7
Scheduling Parallel Optical Circuit Switches for AI TrainingKevin Liang, Litao Qiao, Isaac Keslassy et al.
The rapid growth of AI training has dramatically increased datacenter traffic demand and energy consumption, which has motivated renewed interest in optical circuit switches (OCSes) as a high-bandwidth, energy-efficient alternative for AI fabrics. Deploying multiple parallel OCSes is a leading alternative. However, efficiently scheduling time-varying traffic matrices across parallel optical switches with non-negligible reconfiguration delays remains an open challenge. We consider the problem of scheduling a single AI traffic demand matrix $D$ over $s$ parallel OCSes while minimizing the makespan under reconfiguration delay $δ$. Our algorithm Spectra relies on a three-step approach: Decompose $D$ into a minimal set of weighted permutations; Schedule these permutations across parallel switches using load-aware assignment; then Equalize the imbalanced loads on the switches via controlled permutation splitting. Evaluated on realistic AI training workloads (GPT model and Qwen MoE expert routing) as well as standard benchmarks, Spectra vastly outperforms a baseline based on state-of-the-art algorithms, reducing schedule makespan by an average factor of $1.4\times$ on GPT AI workloads, $1.9\times$ on MoE AI workloads, and $2.4\times$ on standard benchmarks. Further, the makespans achieved by Spectra consistently approach newly derived lower bounds.
LGSep 26, 2019
RADE: Resource-Efficient Supervised Anomaly Detection Using Decision Tree-Based Ensemble MethodsShay Vargaftik, Isaac Keslassy, Ariel Orda et al.
Decision-tree-based ensemble classification methods (DTEMs) are a prevalent tool for supervised anomaly detection. However, due to the continued growth of datasets, DTEMs result in increasing drawbacks such as growing memory footprints, longer training times, and slower classification latencies at lower throughput. In this paper, we present, design, and evaluate RADE - a DTEM-based anomaly detection framework that augments standard DTEM classifiers and alleviates these drawbacks by relying on two observations: (1) we find that a small (coarse-grained) DTEM model is sufficient to classify the majority of the classification queries correctly, such that a classification is valid only if its corresponding confidence level is greater than or equal to a predetermined classification confidence threshold; (2) we find that in these fewer harder cases where our coarse-grained DTEM model results in insufficient confidence in its classification, we can improve it by forwarding the classification query to one of expert DTEM (fine-grained) models, which is explicitly trained for that particular case. We implement RADE in Python based on scikit-learn and evaluate it over different DTEM methods: RF, XGBoost, AdaBoost, GBDT and LightGBM, and over three publicly available datasets. Our evaluation over both a strong AWS EC2 instance and a Raspberry Pi 3 device indicates that RADE offers competitive and often superior anomaly detection capabilities as compared to standard DTEM methods, while significantly improving memory footprint (by up to 5.46x), training-time (by up to 17.2x), and classification latency (by up to 31.2x).