AICPApr 28

ValueAlpha: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable

arXiv:2604.2522433.9
AI Analysis

For AI-finance evaluation, ValueAlpha provides a pre-calibration metrology layer to prevent reporting of unreliable LLM-judged claims, addressing the problem of unvalidated judges that may reward verbosity or rubric mimicry.

ValueAlpha introduces a preregistered agreement-gated stress-test protocol to determine when LLM-judged investment rationales are reliable enough to report, before returns are observable. In a controlled prototype with 1,100 trajectories, the protocol achieved an aggregate agreement of κ_w=0.7168 but identified several failures, including a per-dimension gate failure (constraint_awareness, κ_w=0.2022) and a -2.81 rubric-point penalty for terse-correct rationales.

Long-horizon investment decisions create a pre-realization evaluation problem: realized returns are the eventual arbiter of investment quality, but they arrive too late and are too noisy to guide many model-development and governance decisions. LLM judges offer a tempting substitute for pre-deployment evaluation of AI-finance systems, but unvalidated judges may reward verbosity, confidence, or rubric mimicry rather than financial judgment. This paper introduces \textbf{ValueAlpha}, a preregistered agreement-gated stress-test protocol for deciding when LLM-judged investment-rationale claims are publishable, qualified, or invalid. In a controlled market-state capital-allocation prototype with 1,000 honest decision cycles and 100 preregistered adversarial controls (1,100 trajectories, 5,500 judge calls), ValueAlpha clears the aggregate agreement gate at \(\barκ_w = 0.7168\) but prevents several overclaims. Lower-rank systems collapse into a tie-class, one rubric dimension fails the per-dimension gate (\texttt{constraint\_awareness}, \(\barκ_w = 0.2022\)), single-judge rankings are family-dependent, and terse-correct rationales receive a \(Δ= -2.81\) rubric-point penalty relative to honest rationales. A targeted anchor-specificity probe further shows that financial constructs such as constraint awareness are operationally load-bearing. The contribution is therefore not a leaderboard and not a claim to measure true investment skill. ValueAlpha is a pre-calibration metrology layer for AI-finance evaluation: it determines whether a proposed LLM-judge-based investment-rationale claim is stable enough, agreed enough, and uncontaminated enough to be reported at all.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes