It's Not the Size: Harness Design Determines Operational Stability in Small Language Models

arXiv:2605.1212917.1

Predicted impact top 83% in SE · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners deploying small language models, this work demonstrates that operational stability depends more on harness engineering than model size, though findings are limited to 2-3B parameter models and specific tasks.

The paper shows that harness design significantly impacts small language model performance, with a 4-stage pipeline achieving TSR=0.952 and VTSR=1.000 on Gemma4 E2B, while minimal-shell can underperform raw prompts due to scaffold collapse.

This paper experimentally analyzes how the level of harness engineering affects the operational performance of small language models (SLMs, 2-3B parameters). Three harness conditions - model-only (raw prompt), minimal-shell (wrapper tags), and a 4-stage pipeline (plan->execute->verify->recover) - are applied to three models (Gemma4 E2B, Qwen3.5:2B, LLaMA 3.2 3B) across 24 tasks, comparing Task Success Rate (TSR) and Valid TSR (VTSR). The pipeline harness achieves TSR=0.952 and VTSR=1.000 on Gemma4 E2B (T1-T5, 21 tasks). A non-monotonic phenomenon - minimal-shell TSR < model-only TSR - is observed in two models. In LLaMA 3.2 3B model-only, seven format violations yield TSR=0.429, revealing scaffold collapse: the model abandons JSON structure under complex format requirements without harness support. Ablation shows planning and recovery each contribute approximately 24.7% of total gain. VCR (Verification Catch Rate)=0.625 across all pipeline runs.

View on arXiv PDF

Similar