Weight Ensembling Improves Reasoning in Language Models
This addresses a failure mode in training reasoning models for AI applications, offering a simple intervention to improve performance, though it is incremental as it builds on existing weight interpolation techniques.
The paper tackles the problem of diversity collapse during training of reasoning models, where Pass@k deteriorates despite Pass@1 improvements, and finds that weight ensembling (WiSE-FT) recovers Pass@k and improves Pass@1, achieving better test-time scaling and superior results with less data in reinforcement learning.
We investigate a failure mode that arises during the training of reasoning models, where the diversity of generations begins to collapse, leading to suboptimal test-time scaling. Notably, the Pass@1 rate reliably improves during supervised finetuning (SFT), but Pass@k rapidly deteriorates. Surprisingly, a simple intervention of interpolating the weights of the latest SFT checkpoint with an early checkpoint, otherwise known as WiSE-FT, almost completely recovers Pass@k while also improving Pass@1. The WiSE-FT variant achieves better test-time scaling (Best@k, majority vote) and achieves superior results with less data when tuned further by reinforcement learning. Finally, we find that WiSE-FT provides complementary performance gains that cannot be achieved only through diversity-inducing decoding strategies, like temperature scaling. We formalize a bias-variance tradeoff of Pass@k with respect to the expectation and variance of Pass@1 over the test distribution. We find that WiSE-FT can reduce bias and variance simultaneously, while temperature scaling inherently trades off between bias and variance.