Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models
For practitioners of LLM fine-tuning, the paper reveals that pre-SFT base behavior, not SFT recipe, determines absolute performance ceilings, and that procedural-skill gains are uniform across model sizes.
The paper measures procedural-skill SFT contribution across Qwen3.5 models (0.8B-4B) and finds uniform lift (~+0.04-0.075) but a W-shaped pre-SFT trajectory that dominates variation, with SFT working hardest where base models struggle. A format-compliance artifact biased earlier results, and cross-family validation with GPT-5.4 confirms findings.
We measure procedural-skill SFT contribution across three Qwen3.5 dense scales (0.8B, 2B, 4B) on a 200-task / 40-skill holdout, with Claude Haiku 4.5 as a frontier reference. The corpus is 353 rows of (task + procedural-skill block, Opus chain-of-thought, judge-pass) demonstrations. \textbf{Main finding.} Under matched-path LLM-only scoring, the SFT-attributable procedural-$Δ$ lift is roughly uniform across sizes: $+0.070$ / $+0.040$ / $+0.075$ at 0.8B / 2B / 4B. Variation in post-SFT $Δ$ ($-0.005$, $+0.100$, $+0.065$) is dominated by a W-shaped pre-SFT base trajectory ($-0.075$, $+0.060$, $-0.010$, Haiku-4-5 at $+0.030$): the 5-step procedure hurts 0.8B and 4B, helps 2B, and helps frontier Haiku modestly. SFT works hardest in absolute terms where the base struggles with the procedure -- a regime-asymmetric pattern with a falsifiable prediction at 8B/14B. \textbf{Methodology.} (i) A bench format-compliance artifact: 83.5\% of the holdout uses a deterministic \texttt{ANSWER}-line extractor that under-counts free-form conclusions; an LLM-only re-judge reveals it was systematically biased against \CU. (ii) A negative-iteration sequence at 0.8B: five recipe variants cluster post-SFT \CU{} pass-rate within a 2\,pp band, constraining the absolute-pass-rate ceiling to base capacity rather than recipe. \textbf{Cross-family validation.} GPT-5.4 via OpenRouter on all 7 configurations (2800 paired episodes) agrees on the direction of every per-student finding: Cohen's $κ\geq 0.754$, agreement $\geq 93.25\%$. Earlier ``format-only at 0.8B'' and ``shrinking SFT at 4B'' framings were path-mismatch artifacts; this paper supersedes both (Appendix~\ref{sec:appendix-path}). Single-seed; threats in §\ref{sec:threats}.