CLAILGOct 9, 2025

Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT

arXiv:2510.08404v1h-index: 23Proceedings of the First BabyLM Workshop
Originality Highly original
AI Analysis

This work addresses inefficiency in language model pretraining for researchers and practitioners, suggesting a potential paradigm shift away from deep scaling laws.

The paper tackles the problem of inefficient pretraining in large language models by introducing Co⁴, a tiny single-layer model with 8M parameters and O(N) cost, which outperforms GPT-2 and GPT-BERT baselines in the BabyLM Challenge with greater training efficiency and strong performance on SuperGLUE tasks.

We show that a tiny Co$^4$ machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of $O(N)$ (where $N$ is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, $O(N^2))$ and GPT-BERT (30M, 12 layers, $O(N^2))$ in just two epochs, while both are trained for ten. Co$^4$ achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co$^4$ exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co$^4$ outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes