AIOct 23, 2025

Capability Ceilings in Autoregressive Language Models: Empirical Evidence from Knowledge-Intensive Tasks

arXiv:2510.21866v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This quantifies scaling failures for knowledge-intensive applications using OPT and Pythia architectures, informing resource allocation decisions, though it is incremental as it focuses on specific model families.

The study identified capability ceilings in autoregressive language models, showing that knowledge-intensive tasks like MMLU mathematics benchmarks see negligible accuracy improvement (remaining at 19-20%) despite a 31% reduction in cross-entropy loss, while procedural tasks scale conventionally.

We document empirical capability ceilings in decoder-only autoregressive language models across knowledge-intensive tasks. Systematic evaluation of OPT and Pythia model families (70M-30B parameters, spanning 240 times scaling) reveals that knowledge retrieval tasks show negligible accuracy improvement despite smooth loss reduction. On MMLU mathematics benchmarks, accuracy remains flat at 19-20% (below 25% random chance) across all scales while cross-entropy loss decreases by 31%. In contrast, procedural tasks like arithmetic show conventional scaling where both metrics improve together. Attention intervention experiments reveal high sensitivity to perturbation: swapping attention patterns between models causes catastrophic performance collapse (complete accuracy loss) rather than graceful degradation. These measurements have immediate engineering implications: for knowledge-intensive applications using OPT and Pythia architectures, parameter scaling beyond 1-2B offers minimal accuracy gains despite continued loss improvement. Our findings quantify capability-specific scaling failures in these model families to inform resource allocation decisions. Whether these patterns reflect fundamental constraints of decoder-only architectures or implementation-specific limitations remains an open question requiring investigation across diverse architectural approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes