LGAILOApr 15

SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization

arXiv:2604.1351513.9h-index: 1
Predicted impact top 37% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners of post-training for autoformalization, this work identifies data overlap as a critical hyperparameter, showing that disjoint data consistently outperforms overlapping data at no extra cost.

This paper investigates the effect of data overlap between supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) stages in post-training for Lean 4 autoformalization. Keeping SFT and GRPO data disjoint yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while full overlap renders GRPO redundant.

Supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) is a common post-training recipe. We conduct a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B (thinking disabled) post-trained for Lean 4 autoformalization under six conditions that differ solely in training recipe: a base model, SFT-only, GRPO-only, and three SFT+GRPO configurations where 0 percent, 30 percent, or 100 percent of the GRPO prompts coincide with the SFT corpus. Keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute cost. Evaluating on Gaokao-Formal and PutnamBench under both compile pass at k and semantic pass at k assessed by an LLM judge, we find that lower overlap is monotonically associated with higher compilation and semantic accuracy. At 0 percent overlap, GRPO yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while at 100 percent overlap both metrics remain flat, rendering the GRPO stage effectively redundant. We further show that dual-metric evaluation reveals compile semantic gaps exceeding 30 percentage points for the highest compiling models, a disparity invisible under compile-only benchmarking. To our knowledge, this is the first controlled investigation of SFT-GRPO data overlap as a post-training hyperparameter, demonstrating how model behavior varies based on the degree of data sharing between training stages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes