Vamsi Addanki

NI
h-index9
3papers
18citations
Novelty45%
AI Score40

3 Papers

NIMay 26
Birkhoff Decompositions and Photonic Interconnects Wait! Don't Forget the Compute!

Eliezer Amponsah, Vamsi Addanki

The growing demand for efficient communication in distributed training and inference has sparked significant interest in reconfigurable photonic interconnects across both academia and industry. Mixture-of-Experts (MoE) models, with their highly skewed communication patterns, present a natural opportunity for such circuit-switched fabrics. However, existing approaches largely optimize communication in isolation, overlooking the interaction between communication and the expert computation that follows. In this paper, we revisit circuit scheduling for all-to-all communication in MoE execution. We show that the dispatch--compute--combine structure fundamentally challenges classical scheduling techniques such as Birkhoff--von Neumann (BvN) decomposition. First, MoE communication matrices are rarely doubly stochastic, introducing significant scheduling bubbles in BvN-based schedules. Second, while decomposition enables communication--compute overlap, the excessive number of matchings produced by BvN fragments execution into small batches, leading to severe compute inefficiencies due to fixed execution overheads. Motivated by these observations, we explore a simple greedy max-weight decomposition strategy that bounds the number of matchings while preserving large batch sizes per matching. Despite its simplicity, the approach significantly improves overlap efficiency, reduces compute overheads, and approaches the performance of an ideal congestion-free all-to-all.

NIMay 22
BShare: Packet Queueing Delay-Driven Buffer Sharing for Datacenter Switches

Krishna Agarwal, Muhamad Rizka Maulana, Vamsi Addanki et al.

Modern datacenter switches share packet buffers across ports to boost overall throughput and reduce packet loss. However, as buffer availability per-port-per-bandwidth unit continues to decrease, existing buffer-sharing strategies face increasing performance challenges. Recent efforts have attempted to integrate Buffer Management (BM) with Active Queue Management (AQM) to harness the advantages of both BM and AQM approaches to improve performance. While these hybrid solutions show promise, their complexity of dynamically calculating multiple factors for integration hinders generalization and efficiency. This paper presents BShare, a simple buffer sharing mechanism that uses packet queueing delay. BShare requires only a single operator-configurable parameter. Our simulation results show that BSHARE improves the flow completion time (FCT) performance of advanced transport protocols, such as PowerTCP, by up to 45.07% compared to ABM, particularly under burst-heavy datacenter workloads.

NIJan 5, 2024
Credence: Augmenting Datacenter Switch Buffer Sharing with ML Predictions

Vamsi Addanki, Maciej Pacut, Stefan Schmid

Packet buffers in datacenter switches are shared across all the switch ports in order to improve the overall throughput. The trend of shrinking buffer sizes in datacenter switches makes buffer sharing extremely challenging and a critical performance issue. Literature suggests that push-out buffer sharing algorithms have significantly better performance guarantees compared to drop-tail algorithms. Unfortunately, switches are unable to benefit from these algorithms due to lack of support for push-out operations in hardware. Our key observation is that drop-tail buffers can emulate push-out buffers if the future packet arrivals are known ahead of time. This suggests that augmenting drop-tail algorithms with predictions about the future arrivals has the potential to significantly improve performance. This paper is the first research attempt in this direction. We propose Credence, a drop-tail buffer sharing algorithm augmented with machine-learned predictions. Credence can unlock the performance only attainable by push-out algorithms so far. Its performance hinges on the accuracy of predictions. Specifically, Credence achieves near-optimal performance of the best known push-out algorithm LQD (Longest Queue Drop) with perfect predictions, but gracefully degrades to the performance of the simplest drop-tail algorithm Complete Sharing when the prediction error gets arbitrarily worse. Our evaluations show that Credence improves throughput by $1.5$x compared to traditional approaches. In terms of flow completion times, we show that Credence improves upon the state-of-the-art approaches by up to $95\%$ using off-the-shelf machine learning techniques that are also practical in today's hardware. We believe this work opens several interesting future work opportunities both in systems and theory that we discuss at the end of this paper.