Debate Helps Weak Judges Reward Stronger Models

Ethan Elasky, Frank Nakasako, Naman Goyal

arXiv:2605.2748393.5

Predicted impact top 7% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For AI alignment researchers seeking scalable oversight methods, this work identifies necessary conditions for debate to outperform simpler baselines and proposes a cheaper primitive (answer, critique, judge) for verifiable domains.

The paper studies proposer-critic debate for scalable oversight on verifiable code and logic tasks, finding that debate helps a weaker judge only when the critic's classification ability exceeds the judge's and the judge treats critic speeches as claims to verify. On three of five model pairings meeting these conditions, debate yields statistically significant gains over consultancy; on two non-responder pairings, debate produces null effects. Ablating rebuttal rounds shows no measurable change, suggesting a single independent critique recovers most of debate's benefit.

Despite theoretical promise, debate as a scalable oversight protocol has produced mixed empirical results: gains in some settings, and null effects in others, especially when the judge does not have information hidden from it. We study proposer-critic debate in a stronger-debater/weaker-judge setting on programmatically verifiable code and logic tasks. Debate helps the judge over a consultancy baseline when the critic provides a usable advantage: the critic's classification ability must exceed the judge's, and the judge must treat critic speeches as claims to verify rather than testimony to summarize. On the three of five pairings where the condition holds, proposer-critic debate's gains are statistically significant over consultancy, and these pairings are the most capable model pairings. On the two non-responder pairings in our set, debate produces null effects, and judge verification rates drop by tens of percentage points once a critic enters the transcript. In these cases the critic's binary-classification ability and the judge's are within noise of each other, and the critic's disagreement is parsed as testimony rather than a claim to check. Ablating rebuttal rounds from debate produces no measurable change in judge performance: a single independent critique recovers the bulk of debate's benefit at lower inference cost. These findings suggest a cheaper primitive for training-free scalable oversight in verifiable domains (answer, critique, judge) and a pre-deployment audit (does the critic beat the judge, and will the judge verify it?) that predicts when debate will help.

View on arXiv PDF

Similar