CUCo: An Agentic Framework for Compute and Communication Co-design
This addresses a bottleneck in large-scale distributed LLM systems by automating kernel development, though it is incremental as it builds on prior optimization work.
The paper tackles the problem of manually writing CUDA kernels for GPU utilization in distributed LLM training and inference by introducing CUCo, an agent-driven workflow that automatically generates kernels co-optimizing computation and communication, reducing end-to-end latency by up to 1.57× compared to state-of-the-art baselines.
Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a labor-intensive and error-prone process. Prior work on kernel optimization has focused almost exclusively on computation, leaving communication kernels largely untouched even though they constitute a significant share of total execution time. We introduce CUCo, a training-free agent-driven workflow that automatically generates high-performance CUDA kernels that jointly orchestrate computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks new optimization opportunities unavailable to existing approaches, outperforming state-of-the-art baselines and reducing end-to-end latency by up to $1.57\times$.