Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference
This addresses efficiency challenges in GPU-based inference for MoE models, which are increasingly important in large language models, though it appears incremental as it builds on existing batching and MoE concepts.
The paper tackles the problem of executing irregular workloads on GPUs by proposing a static batching framework with runtime task mapping, and applies it to Mixture-of-Experts (MoE) model inference, achieving up to 91% of peak Tensor Core throughput on NVIDIA H800 and 95% on H20 GPUs.
It has long been a problem to arrange and execute irregular workloads on massively parallel devices. We propose a general framework for statically batching irregular workloads into a single kernel with a runtime task mapping mechanism on GPUs. We further apply this framework to Mixture-of-Experts (MoE) model inference and implement an optimized and efficient CUDA kernel. Our MoE kernel achieves up to 91% of the peak Tensor Core throughput on NVIDIA H800 GPU and 95% on NVIDIA H20 GPU.