31.5NIMay 19
SKYLINK: Scalable and Resilient Link Management in LEO Satellite NetworkWanja de Sombre, Arash Asadi, Debopam Bhattacherjee et al.
The rapid growth of space-based services has established LEO satellite networks as a promising option for global broadband connectivity. Next-generation LEO networks leverage inter-satellite links (ISLs) to provide faster and more reliable communications compared to traditional bent-pipe architectures, even in remote regions. However, the high mobility of satellites, dynamic traffic patterns, and potential link failures pose significant challenges for efficient and resilient routing. To address these challenges, we model the LEO satellite network as a time-varying graph comprising a constellation of satellites and ground stations. Our objective is to minimize a weighted sum of average delay and packet drop rate. Each satellite independently decides how to distribute its incoming traffic to neighboring nodes in real time. Given the infeasibility of finding optimal solutions at scale, due to the exponential growth of routing options and uncertainties in link capacities, we propose SKYLINK, a novel fully distributed learning strategy for link management in LEO satellite networks. SKYLINK enables each satellite to adapt to the time-varying network conditions, ensuring real-time responsiveness, scalability to millions of users, and resilience to network failures, while maintaining low communication overhead and computational complexity. To support the evaluation of SKYLINK at global scale, we develop a new simulator for large-scale LEO satellite networks. For 25.4 million users, SKYLINK reduces the weighted sum of average delay and drop rate by 29% compared to the bent-pipe approach, and by 92% compared to Dijkstra. It lowers drop rates by 95% relative to k-shortest paths, 99% relative to Dijkstra, and 74% compared to the bent-pipe baseline, while achieving up to 46% higher throughput. At the same time, SKYLINK maintains constant computational complexity with respect to constellation size.
56.8DCMay 22
XWind: A Cross-site Router for Large Language Model Inference Serving at Renewable Energy FarmsTella Rajashekhar Reddy, Atharva Deshmukh, Liangcheng Yu et al.
AI power demand is growing at an unprecedented rate while power grids are often ailing and struggle to keep up. Grid expansion comes with high capital expenditure and long-distance transmission losses, yet there is abundant renewable energy at the source, just not matched to demand. This paper proposes a complementary AI infrastructure deployment model, AI Greenferencing, that brings modular AI compute to renewable energy sources, focusing on wind, allowing AI footprint expansion, generating local behind-the-meter demand for renewable sites, and helping ease the growing strain on power utilities. Our feasibility analysis shows that 890+ GW of wind capacity lies within 50 ms network round trip time of Azure data centers, and that site-wise right-sizing combined with spatial complementarity of wind energy keeps aggregate fleet utilization on par with traditional deployments. To serve inference requests under variable wind power, we build XWind, a lightweight, reactive, and workload-agnostic AI inference router that uses only real-time signals: inference latency, KV-cache utilization, and queue depth, to dynamically configure sites and distribute requests. Evaluated on a real 64-GPU A100 testbed emulating three wind-powered sites with Azure production traces, XWind reduces P99 end-to-end latency by up to 52% over the strongest contender (also our idea) and by up to 98% over baselines such as power-capping and GPU idling, with consistent gains across workload types, load levels, and GPU generations.
55.7IMApr 24Code
CosmicDancePro -- Measuring LEO satellite's orbital decay and network connectivity implications during solar stormsSuvam Basak, Amitangshu Pal, Debopam Bhattacherjee
The May 2024 solar superstorm highlighted the vulnerability of rapidly expanding low Earth orbit (LEO) satellite networks to severe space weather events. To systematically evaluate LEO network resilience, we introduce an open-source tool, CosmicDancePro. It enables a comprehensive analysis of the effects of solar storms in the LEO satellite network. It integrates real-world multimodal datasets, including space weather measurements from several satellites, upper-atmospheric density conditions from data-driven and high-fidelity physics-based models, and LEO satellite trajectory and LEO network measurement traces to quantify orbital decay driven by enhanced atmospheric density and network connectivity degradation. We utilize CosmicDancePro to analyze the Starlink constellation's behavior during two recent major solar storms. First, we identify the specific fleet management strategies Starlink adopts during the May 2024 solar superstorm and how they differ from its regular orbit-correction strategy. Second, we identify the mechanisms driving the previously unexplained 'W'-shaped altitude variation pattern across orbital planes of LEO constellations. Finally, our network-layer analysis quantifies the connectivity degradation during these storms, revealing transient disruptions that include repetitive short-lived outages, reconfiguration latency spikes above 500 ms, up to 60% increase in uplink loss, distorted diurnal latency patterns, and a 10+ Mbps drop in end-user data rates during storm peaks.
DCNov 16, 2024Code
Improving training time and GPU utilization in geo-distributed language model trainingPalak, Tella Rajashekhar Reddy, Bhaskar Kataria et al.
The widespread adoption of language models (LMs) has caused a huge surge in demand for GPUs. Training large LMs requires tens of thousands of GPUs and housing them in the same datacenter (DC) is a challenge due to many constraints including availability of peak power. We focus on training such models across multiple DCs connected via the Wide-Area-Network (WAN). We built Atlas that speeds up the training time using novel workload-aware temporal bandwidth sharing and other design choices. While Atlas improves the training time, it does not completely eliminate the bubbles (idle GPU cycles). We built BubbleTea that runs prefill-as-a-service (part of LM inference) during the bubbles thus improving the GPU utilization without any impact on training. Compared to state-of-the-art designs, Atlas and BubbleTea together achieve up to 17x faster training, and up to 94% GPU utilization. The code will be open-sourced.
DCOct 17, 2025
BeLLMan: Controlling LLM CongestionTella Rajashekhar Reddy, Atharva Deshmukh, Karan Tandon et al.
Large language model (LLM) applications are blindfolded to the infrastructure underneath and generate tokens autoregressively, indifferent to the system load, thus risking inferencing latency inflation and poor user experience. Our first-cut controller, named beLLMan, enables the LLM infrastructure to actively and progressively signal the first-party LLM application to adjust the output length in response to changing system load. On a real testbed with H100 GPUs, beLLMan helps keep inferencing latency under control (upto 8X lower end-to-end latency) and reduces energy consumption by 25% (while serving 19% more requests) during periods of congestion for a summarization workload.
DCMay 15, 2025
AI Greenferencing: Routing AI Inferencing to Green Modular Data Centers with HeronTella Rajashekhar Reddy, Palak, Rohan Gandhi et al.
AI power demand is growing unprecedentedly thanks to the high power density of AI compute and the emerging inferencing workload. On the supply side, abundant wind power is waiting for grid access in interconnection queues. In this light, this paper argues bringing AI workload to modular compute clusters co-located in wind farms. Our deployment right-sizing strategy makes it economically viable to deploy more than 6 million high-end GPUs today that could consume cheap, green power at its source. We built Heron, a cross-site software router, that could efficiently leverage the complementarity of power generation across wind farms by routing AI inferencing workload around power drops. Using 1-week ofcoding and conversation production traces from Azure and (real) variable wind power traces, we show how Heron improves aggregate goodput of AI compute by up to 80% compared to the state-of-the-art.