Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations
This work provides a robust environment and methods to improve the training of LLMs for generating high-performance GPU kernels, which is significant for developers and researchers in scalable AI systems.
This paper addresses challenges in training large language models (LLMs) for high-quality kernel generation, specifically focusing on reward hacking and lazy optimization. The authors developed KernelGYM, a distributed GPU environment, and proposed TRLOO to provide unbiased advantage estimation in multi-turn reinforcement learning. Their model, Dr Kernel-14B, achieved a 1.2x speedup over Torch reference for 31.6% of generated kernels on KernelBench Level-2, outperforming Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%).
High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, Dr Kernel-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr Kernel-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in https://www.github.com/hkust-nlp/KernelGYM.