When Small Models Are Right for Wrong Reasons: Process Verification for Trustworthy Agents
This addresses the reliability crisis for deploying small language models as trustworthy autonomous agents, highlighting that standard accuracy metrics are insufficient, though it is incremental in proposing a new verification method.
The paper tackles the problem of small language models (7-9B parameters) producing correct answers with flawed reasoning, a 'Right-for-Wrong-Reasons' phenomenon affecting 50-69% of cases, and introduces the Reasoning Integrity Score (RIS) as a process-based metric to address this, showing that retrieval-augmented generation improves reasoning integrity while meta-cognitive interventions often harm it.
Deploying small language models (7-9B parameters) as autonomous agents requires trust in their reasoning, not just their outputs. We reveal a critical reliability crisis: 50-69\% of correct answers from these models contain fundamentally flawed reasoning -- a ``Right-for-Wrong-Reasons'' phenomenon invisible to standard accuracy metrics. Through analysis of 10,734 reasoning traces across three models and diverse tasks, we introduce the Reasoning Integrity Score (RIS), a process-based metric validated with substantial inter-rater agreement ($κ=0.657$). Conventional practices are challenged by our findings: while retrieval-augmented generation (RAG) significantly improves reasoning integrity (Cohen's $d=0.23$--$0.93$), meta-cognitive interventions like self-critique often harm performance ($d=-0.14$ to $-0.33$) in small models on the evaluated tasks. Mechanistic analysis reveals RAG succeeds by grounding calculations in external evidence, reducing errors by 7.6\%, while meta-cognition amplifies confusion without sufficient model capacity. To enable deployment, verification capabilities are distilled into a neural classifier achieving 0.86 F1-score with 100$\times$ speedup. These results underscore the necessity of process-based verification for trustworthy agents: accuracy alone is dangerously insufficient when models can be right for entirely wrong reasons.