CLMay 27

SuperValid: Capability-Aligned OOD Validation for Generalizable Downstream Scaling

arXiv:2605.2817980.7
Predicted impact top 67% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners scaling LLMs, SuperValid provides a training-free metric for model selection and early stopping without benchmark evaluation, addressing generalization limitations of prior scaling law approaches.

SuperValid introduces a capability-aligned OOD validation loss that predicts downstream performance across diverse models and training distributions, achieving strong correlation with benchmark results across 17 benchmarks in 6 capability domains.

Scaling laws guide large language model training by relating compute to cross-entropy loss, and recent work further extends them to predict downstream benchmark performance. However, prior approaches face generalization limitations from two aspects: focusing on benchmark-level performance introduces scenario-specific artifacts, while relying on IID validation loss fails to track capability improvements when training distributions vary. In this work, we argue that downstream scaling should be studied at the capability level, which captures shared skill factors across related tasks while abstracting away benchmark-specific noise. We propose SuperValid, a framework that synthesizes OOD (out-of-distribution), capability-aligned validation data by distilling core concepts from benchmarks within a capability domain and expanding them into diverse, knowledge-rich texts. Extensive experiments spanning 17 benchmarks grouped into 6 capability domains show that SuperValid loss exhibits strong and stable correlation with downstream performance across models of different architectures, scales, and training data distributions. As a training-free metric computable during training without benchmark evaluation, SuperValid enables effective model selection, early stopping, and scaling decisions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes