Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions
This addresses a specific bottleneck in LLM error detection for users relying on automated review, though it is an incremental improvement over existing methods.
The paper tackled the problem of large language models struggling to catch errors in their own outputs by introducing Cross-Context Review, a method where review is conducted in a fresh session without access to production history, resulting in an F1 score of 28.6% that outperformed other review conditions with statistical significance.
Large language models struggle to catch errors in their own outputs when the review happens in the same session that produced them. This paper introduces Cross-Context Review (CCR), a straightforward method where the review is conducted in a fresh session with no access to the production conversation history. We ran a controlled experiment: 30 artifacts (code, technical documents, presentation scripts) with 150 injected errors, tested under four review conditions -- same-session Self-Review (SR), repeated Self-Review (SR2), context-aware Subagent Review (SA), and Cross-Context Review (CCR). Over 360 reviews, CCR reached an F1 of 28.6%, outperforming SR (24.6%, p=0.008, d=0.52), SR2 (21.7%, p<0.001, d=0.72), and SA (23.8%, p=0.004, d=0.57). The SR2 result matters most for interpretation: reviewing twice in the same session did not beat reviewing once (p=0.11), which rules out repetition as an explanation for CCR's advantage. The benefit comes from context separation itself. CCR works with any model, needs no infrastructure, and costs only one extra session.