LGMay 1

Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors

Chaohao Yuan, Chenghao Xiao, Yu Rong, Hong Cheng, Long-Kai Huang

arXiv:2605.0061094.8Has Code

AI Analysis

Provides a practical, low-cost method to integrate SFT and RLVR for LLMs, addressing a key challenge in post-training.

SFT and RLVR post-training paradigms for LLMs have complementary strengths but are difficult to integrate due to magnitude disparity, sign interference, and heterogeneous updates. The proposed Decoupled Test-time Synthesis (DoTS) combines them at inference via task vector arithmetic, matching or exceeding training-based methods on math reasoning benchmarks at only ~3% computational cost.

SFT and RLVR represent two fundamental yet distinct paradigms for LLM post-training, each excelling in distinct dimensions. SFT expands knowledge breadth while RLVR enhances reasoning depth. Yet integrating these complementary strengths remains a formidable challenge. Sequential training can cause catastrophic forgetting, and joint optimization often suffers from severe gradient conflicts. We analyze SFT and RLVR through the lens of task vectors and reveal three structural properties behind these failures: a 30* magnitude disparity, 45* sign interference, and heterogeneous module-wise update distributions. These findings show SFT and RLVR are difficult to integrate directly, but they also suggest that the two paradigms modify partly complementary components of the model. Motivated by these observations, we propose Decoupled Test-time Synthesis (DoTS), a post-hoc framework allows SFT and RLVR checkpoints to be trained independently and synthesizes their capabilities only at inference time via task vector arithmetic, without updating model parameters. To reduce interference, DOTS applies selective sparsification with norm-preserving rescaling. It then uses Bayesian optimization on a small set of unlabeled queries to search for combination coefficients on the Pareto frontier of consistency and perplexity. Empirically, \ours matches or exceeds the performance of training-based SFT--RLVR integration methods across multiple mathematical reasoning benchmarks, incurring only $\sim$3\% of the computational cost. When applied to stronger post-trained checkpoints, DOTS surpasses SOTA models and generalizes to out-of-domain benchmarks without re-tuning. Code is available at https://github.com/chaohaoyuan/DoTS.

View on arXiv PDF Code

Similar