DCJun 7, 2022
Tutel: Adaptive Mixture-of-Experts at ScaleChangho Hwang, Wei Cui, Yifan Xiong et al. · microsoft-research
Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning models to trillion-plus parameters with fixed computational cost. The algorithmic performance of MoE relies on its token routing mechanism that forwards each input token to the right sub-models or experts. While token routing dynamically determines the amount of expert workload at runtime, existing systems suffer inefficient computation due to their static execution, namely static parallelism and pipelining, which does not adapt to the dynamic workload. We present Flex, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining. Flex designs an identical layout for distributing MoE model parameters and input data, which can be leveraged by all possible parallelism or pipelining methods without any mathematical inequivalence or tensor migration overhead. This enables adaptive parallelism/pipelining optimization at zero cost during runtime. Based on this key design, Flex also implements various MoE acceleration techniques. Aggregating all techniques, Flex finally delivers huge speedup at any scale -- 4.96x and 5.75x speedup of a single MoE layer over 16 and 2,048 A100 GPUs, respectively, over the previous state-of-the-art. Our evaluation shows that Flex efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture. On efficiency, Flex accelerates SwinV2-MoE, achieving up to 1.55x and 2.11x speedup in training and inference over Fairseq, respectively. On effectiveness, the SwinV2-MoE model achieves superior accuracy in both pre-training and down-stream computer vision tasks such as COCO object detection than the counterpart dense model, indicating the readiness of Flex for end-to-end real-world model training and inference.
LGDec 24, 2022
An Adaptive Deep RL Method for Non-Stationary Environments with Piecewise Stable ContextXiaoyu Chen, Xiangming Zhu, Yufeng Zheng et al.
One of the key challenges in deploying RL to real-world applications is to adapt to variations of unknown environment contexts, such as changing terrains in robotic tasks and fluctuated bandwidth in congestion control. Existing works on adaptation to unknown environment contexts either assume the contexts are the same for the whole episode or assume the context variables are Markovian. However, in many real-world applications, the environment context usually stays stable for a stochastic period and then changes in an abrupt and unpredictable manner within an episode, resulting in a segment structure, which existing works fail to address. To leverage the segment structure of piecewise stable context in real-world applications, in this paper, we propose a \textit{\textbf{Se}gmented \textbf{C}ontext \textbf{B}elief \textbf{A}ugmented \textbf{D}eep~(SeCBAD)} RL method. Our method can jointly infer the belief distribution over latent context with the posterior over segment length and perform more accurate belief context inference with observed data within the current context segment. The inferred belief context can be leveraged to augment the state, leading to a policy that can adapt to abrupt variations in context. We demonstrate empirically that SeCBAD can infer context segment length accurately and outperform existing methods on a toy grid world environment and Mujuco tasks with piecewise-stable context.
83.1LGMay 8
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context InferenceFeiyu Yao, Zhixiong Niu, Xiaqing Li et al.
Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention reduces attention cost in this setting, sparsity alone is insufficient for end-to-end efficiency. GPU-only designs remain constrained by PCIe bandwidth and metadata memory overhead, while CPU-GPU hybrid designs still suffer from substantial GPU idle time and bottlenecks in CPU-side top-k selection and sparse attention computation. Fluxion is built on three key insights: output-aware KV budgeting, head-specific and granularity-aware sparse configuration, and cross-device coordinated execution for sparse attention over CPU-resident KV caches. Guided by these insights, Fluxion combines a lightweight head-property predictor, a granularity-budget selector, and a priority-based scheduler to jointly optimize budget allocation, sparse configuration, and CPU-GPU execution overlap. This co-design enables hybrid sparse attention to achieve both accuracy and system efficiency in long-context inference. Across 2 models, 3 benchmarks, and 40 tasks, Fluxion preserves quality well -- the worst average degradation is only -0.26 relative to FULL, while delivering 1.5$\times$-3.7$\times$ speedup over the strongest fixed sparse hybrid baseline, whose KV budget is only 0.05.
ARMar 6
LUMINA: LLM-Guided GPU Architecture Exploration via Bottleneck AnalysisTao Zhang, Rui Ma, Shuotao Xu et al.
GPU design space exploration (DSE) for modern AI workloads, such as Large-Language Model (LLM) inference, is challenging because of GPUs' vast, multi-modal design spaces, high simulation costs, and complex design optimization objectives (e.g. performance, power and area trade-offs). Existing automated DSE methods are often prohibitively expensive, either requiring an excessive number of exploration samples or depending on intricate, manually crafted analyses of interdependent critical paths guided by human heuristics. We present LUMINA, an LLM-driven GPU architecture exploration framework that leverage AI to enhance the DSE efficiency and efficacy for GPUs. LUMINA extracts architectural knowledge from simulator code and performs sensitivity studies to automatically compose DSE rules,which are auto-corrected during exploration. A core component of LUMINA is a DSE Benchmark that comprehensively evaluates and enhances LLMs' capabilities across three fundamental skills required for architecture optimization, which provides a principled and reproducible basis for model selection and ensuring consistent architectural reasoning. In the design space with 4.7 million possible samples, LUMINA identifies 6 designs of better performance and area than an A100 GPU efficiently, using only 20 steps via LLM-assisted bottleneck analysis. In comparison, LUMINA achieves 17.5x higher than design space exploration efficiency, and 32.9% better designs (i.e. Pareto Hypervolume) than Machine-Learning baselines, showcasing its ability to deliver high-quality design guidance with minimal search cost.
NIAug 25, 2025
Automating Conflict-Aware ACL Configurations with Natural Language IntentsWenlong Ding, Jianqiang Li, Zhixiong Niu et al.
ACL configuration is essential for managing network flow reachability, yet its complexity grows significantly with topologies and pre-existing rules. To carry out ACL configuration, the operator needs to (1) understand the new configuration policies or intents and translate them into concrete ACL rules, (2) check and resolve any conflicts between the new and existing rules, and (3) deploy them across the network. Existing systems rely heavily on manual efforts for these tasks, especially for the first two, which are tedious, error-prone, and impractical to scale. We propose Xumi to tackle this problem. Leveraging LLMs with domain knowledge of the target network, Xumi automatically and accurately translates the natural language intents into complete ACL rules to reduce operators' manual efforts. Xumi then detects all potential conflicts between new and existing rules and generates resolved intents for deployment with operators' guidance, and finally identifies the best deployment plan that minimizes the rule additions while satisfying all intents. Evaluation shows that Xumi accelerates the entire configuration pipeline by over 10x compared to current practices, addresses O(100) conflicting ACLs and reduces rule additions by ~40% in modern cloud network.
DCMar 14, 2021
CrossoverScheduler: Overlapping Multiple Distributed Training Applications in a Crossover MannerCheng Luo, Lei Qu, Youshan Miao et al.
Distributed deep learning workloads include throughput-intensive training tasks on the GPU clusters, where the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays after backward propagation, forces workers to wait for the gradient synchronization via a centralized parameter server or directly in decentralized workers. We present CrossoverScheduler, an algorithm that enables communication cycles of a distributed training application to be filled by other applications through pipelining communication and computation. With CrossoverScheduler, the running performance of distributed training can be significantly improved without sacrificing convergence rate and network accuracy. We achieve so by introducing Crossover Synchronization which allows multiple distributed deep learning applications to time-share the same GPU alternately. The prototype of CrossoverScheduler is built and integrated with Horovod. Experiments on a variety of distributed tasks show that CrossoverScheduler achieves 20% \times speedup for image classification tasks on ImageNet dataset.
DCFeb 17, 2020
Simulating Performance of ML Systems with Offline ProfilingHongming Huang, Peng Cheng, Hong Xu et al.
We advocate that simulation based on offline profiling is a promising approach to better understand and improve the complex ML systems. Our approach uses operation-level profiling and dataflow based simulation to ensure it offers a unified and automated solution for all frameworks and ML models, and is also accurate by considering the various parallelization strategies in a real system.
CRMar 19, 2019
BotGraph: Web Bot Detection Based on SitemapYang Luo, Guozhen She, Peng Cheng et al.
The web bots have been blamed for consuming large amount of Internet traffic and undermining the interest of the scraped sites for years. Traditional bot detection studies focus mainly on signature-based solution, but advanced bots usually forge their identities to bypass such detection. With increasing cloud migration, cloud providers provide new opportunities for an effective bot detection based on big data to solve this issue. In this paper, we present a behavior-based bot detection scheme called BotGraph that combines sitemap and convolutional neural network (CNN) to detect inner behavior of bots. Experimental results show that BotGraph achieves ~95% recall and precision on 35-day production data traces from different customers including the Bing search engine and several sites.
LGDec 27, 2018
Stanza: Layer Separation for Distributed Training in Deep LearningXiaorui Wu, Hong Xu, Bo Li et al.
The parameter server architecture is prevalently used for distributed deep learning. Each worker machine in a parameter server system trains the complete model, which leads to a hefty amount of network data transfer between workers and servers. We empirically observe that the data transfer has a non-negligible impact on training time. To tackle the problem, we design a new distributed training system called Stanza. Stanza exploits the fact that in many models such as convolution neural networks, most data exchange is attributed to the fully connected layers, while most computation is carried out in convolutional layers. Thus, we propose layer separation in distributed training: the majority of the nodes just train the convolutional layers, and the rest train the fully connected layers only. Gradients and parameters of the fully connected layers no longer need to be exchanged across the cluster, thereby substantially reducing the data transfer volume. We implement Stanza on PyTorch and evaluate its performance on Azure and EC2. Results show that Stanza accelerates training significantly over current parameter server systems: on EC2 instances with Tesla V100 GPU and 10Gb bandwidth for example, Stanza is 1.34x--13.9x faster for common deep learning models.