CLMar 23

Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

arXiv:2603.2145419.8h-index: 1
Predicted impact top 88% in CL · last 90 daysOriginality Highly original
AI Analysis

This addresses credibility issues in LLM coding benchmarks for researchers and practitioners, offering a novel detection method that is not incremental.

The paper tackles the problem of benchmark contamination in LLM coding benchmarks by introducing Cross-Context Verification (CCV) and Hierarchical Cross-Context Architecture (HCCA), achieving perfect separation between contaminated and genuine reasoning with a Mann-Whitney U=0 and p≈0.012.

LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE-bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann-Whitney U=0, p approx 0.012, r = 1.0). Key findings: (1) contamination is binary--models either recall perfectly or not at all; (2) reasoning absence is a perfect discriminator; (3) 33% of prior contamination labels are false positives; (4) HCCA's independent analysis structure discovers contamination-flaw composite cases that single-analyst approaches miss. A pilot experiment extending HCCA to multi-stage verification (Worker to Verifier to Director) yields a negative result--100% sycophantic confirmation--providing further evidence that information restriction, not structural complexity, is the key mechanism. We release all code and data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes