LGCLMay 27

Knowing When to Ask: Segment-Level Credit Assignment for LLM Tool Use

arXiv:2605.2778844.0
Predicted impact top 63% in LG · last 90 daysOriginality Highly original
AI Analysis

For LLM practitioners, CARL provides a practical method to teach models when to use tools without step-level supervision, especially benefiting smaller models.

CARL improves LLM tool-use by learning segment-level credit assignment from binary outcomes, achieving 6.7 and 9.7 exact-match gains at 7B and 3B over RL baselines, with 53% fewer unnecessary tool calls and AUC 0.93 for competence detection.

Humans know when to reach for help e.g. $347 \times 28$ warrants a calculator while $2+2$ does not. Language models do not. Prompt-based approaches can instruct a model when to invoke tools, but this scaffolding does not teach it to recognize the boundary of its own knowledge. RL approaches that assign a single outcome reward to the whole trajectory fare no better: trajectory-level credit cannot isolate which tool call in a successful episode actually helped, nor penalize unnecessary calls. We propose \textbf{CARL} (\textbf{C}ompetence-\textbf{A}ware \textbf{R}einforcement \textbf{L}earning), which trains a critic on the model's own rollouts to learn where parametric knowledge suffices and where it needs external help. By decomposing each rollout at natural tool-use boundaries (e.g., code fence delimiters and context block transitions), CARL assigns independent credit to each segment from a single binary outcome, without external judges or step-level annotations. As a result, erroneous tool calls, incorrect extractions, and unnecessary calls each receive appropriately signed advantages. The trained critic captures the model's domain competence: it separates parametrically solvable from tool-dependent questions with AUC 0.93 at 7B. On five benchmarks spanning arithmetic, multi-hop factual QA, and numerical reasoning over financial tables, CARL improves exact-match accuracy by 6.7 points at 7B and 9.7 points at 3B over the best RL baseline, with the largest gain (+8.3 EM at 7B, +9.0 EM at 3B) on Musique. The model issues 53\% fewer tool calls on parametrically answerable questions while remaining ${\sim}10$ EM points more accurate on them. Gains are largest at small scale: the 3B improvement is $1.4\times$ the 7B improvement, suggesting that knowing when to ask disproportionately benefits models with smaller parametric memory.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes