Pretraining on the Test Set Is All You Need
This work addresses the challenge of efficient and effective language model training for academic evaluation, though it appears incremental by focusing on data curation rather than novel architectural changes.
The authors tackled the problem of achieving high performance in language models by pretraining on a small, curated dataset of evaluation benchmarks, resulting in a 1 million parameter model that achieves perfect results across diverse academic benchmarks and outperforms all known foundation models.
Inspired by recent work demonstrating the promise of smaller Transformer-based language models pretrained on carefully curated data, we supercharge such approaches by investing heavily in curating a novel, high quality, non-synthetic data mixture based solely on evaluation benchmarks. Using our novel dataset mixture consisting of less than 100 thousand tokens, we pretrain a 1 million parameter transformer-based LLM \textbf{phi-CTNL} (pronounced ``fictional") that achieves perfect results across diverse academic benchmarks, strictly outperforming all known foundation models. \textbf{phi-CTNL} also beats power-law scaling and exhibits a never-before-seen grokking-like ability to accurately predict downstream evaluation benchmarks' canaries.