CL AIJun 2

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

arXiv:2606.0365072.2h-index: 6Has Code

Predicted impact top 54% in CL · last 90 daysOriginality Highly original

AI Analysis

For practitioners needing to select models for specific applications without reliable benchmarks, CoEval provides a cheap, reusable, and contamination-free evaluation framework.

CoEval enables ranking language models for custom tasks without labeled data or trustworthy benchmarks by synthesizing contamination-free benchmarks from task descriptions and using a cross-family judge ensemble, recovering true rankings with high correlation (ho=0.86) and costing only $5.89 for 7,978 evaluations.

Choosing or ranking language models for a specific application is hardest when no task-specific labeled data exists, and standard public benchmarks cannot be trusted, their items having likely leaked into pretraining, so scores reflect memorization rather than fitness. We present CoEval, an open-source, reusable framework that closes this gap end to end: from only a description of a task or domain, teacher models synthesize a fresh, attribute-controlled benchmark with no human labels, contamination-free because items are generated anew on each run, and a cross-family judge ensemble ranks candidate models with no human raters. Validated where ground truth exists, CoEval recovers the true model ranking and tracks ground-truth correctness at ho=0.86. The label-free judging needs no human calibration because judge-panel composition (vendor diversity), not size, drives reliability: a small, well-chosen cross-family panel is most reliable, while a single judge can be anti-correlated with ground truth (judge-choice regret 0.35) and the ensemble never is. Generated items show zero verbatim 13-gram overlap with five major public benchmarks; the panel cancels verbosity bias and precludes same-family self-preference. A four-task study produced 7,978 evaluations for USD 5.89. The same declarative pipeline applies to any domain and is cheap enough to re-run on every model release: a label-free, contamination-free leaderboard any team can regenerate for its own application.

View on arXiv PDF

Similar