LGMay 12

Targeted Tests for LLM Reasoning: An Audit-Constrained Protocol

arXiv:2605.1159916.2

Predicted impact top 83% in LG · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers evaluating LLM reasoning, this work provides a methodological framework to avoid conflating genuine errors with artifacts, though the proposed sampler shows no improvement over baselines.

The paper proposes an audit-constrained protocol for evaluating LLM reasoning under prompt variation, ensuring that only semantically valid perturbations are counted as model errors. The protocol identifies confirmed model errors but shows that a score-based sampler (CAPS) does not improve audited yield over uniform sampling.

Fixed reasoning benchmarks evaluate canonical prompts, but semantically valid changes in presentation can still change model behavior. Studies of prompt variation can reveal such failures, but without audit they can mix genuine model errors with invalid perturbations, extraction artifacts, and unmatched search procedures. We propose an audit-constrained protocol for targeted reasoning evaluation. Prompt variants are generated from a finite component grammar, rendered deterministically, evaluated under a fixed query budget, and counted as model errors only after semantic and extraction audit. Within this protocol we instantiate Component-Adaptive Prompt Sampling (CAPS), a score-based sampler over prompt components, and compare it with equal-budget uniform component sampling under the same task bank, renderer, model interface, decoding settings, and audit procedure. Across three audited slices, the protocol identifies confirmed model-error prompt keys while excluding formatting and extraction artifacts, but matched comparisons do not show that CAPS improves audited yield or unique prompt-key discovery over uniform sampling. The contribution is methodological: targeted prompt variation can be studied under a reconstructable, reviewable, budget-matched protocol, and proxy-guided policies should be judged by audited yield rather than raw mismatch counts or selected examples alone.

View on arXiv PDF

Similar