Double Descent as a Lens for Sample Efficiency in Autoregressive vs. Discrete Diffusion Models
This work addresses data scarcity in large language models by analyzing sample efficiency, providing insights for model selection in resource-constrained settings, though it is incremental as it builds on existing double descent theory.
The study compared the sample efficiency of discrete diffusion and autoregressive models using the double descent phenomenon, finding that autoregressive models are more efficient on small datasets, while discrete diffusion models require larger capacity and compute to become competitive.
Data scarcity drives the need for more sample-efficient large language models. In this work, we use the double descent phenomenon to holistically compare the sample efficiency of discrete diffusion and autoregressive models. We show that discrete diffusion models require larger capacity and more training epochs to escape their underparameterized regime and reach the interpolation threshold. In the strongly overparameterized regime, both models exhibit similar behavior, with neither exhibiting a pronounced second descent in test loss across a large range of model sizes. Overall, our results indicate that autoregressive models are more sample-efficient on small-scale datasets, while discrete diffusion models only become competitive when given sufficient capacity and compute.