CLMar 23

Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

arXiv:2603.2145419.8h-index: 1

Predicted impact top 88% in CL · last 90 daysOriginality Highly original

AI Analysis

This addresses credibility issues in LLM coding benchmarks for researchers and practitioners, offering a novel detection method that is not incremental.

The paper tackles the problem of benchmark contamination in LLM coding benchmarks by introducing Cross-Context Verification (CCV) and Hierarchical Cross-Context Architecture (HCCA), achieving perfect separation between contaminated and genuine reasoning with a Mann-Whitney U=0 and p≈0.012.

LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE-bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann-Whitney U=0, p approx 0.012, r = 1.0). Key findings: (1) contamination is binary--models either recall perfectly or not at all; (2) reasoning absence is a perfect discriminator; (3) 33% of prior contamination labels are false positives; (4) HCCA's independent analysis structure discovers contamination-flaw composite cases that single-analyst approaches miss. A pilot experiment extending HCCA to multi-stage verification (Worker to Verifier to Director) yields a negative result--100% sycophantic confirmation--providing further evidence that information restriction, not structural complexity, is the key mechanism. We release all code and data.

View on arXiv PDF

Similar