Samip Dahal

h-index1
2papers

2 Papers

8.0LGJun 3
q0: Primitives for Hyper-Epoch Pretraining

Bishwas Mandal, Shmuel Berman, Akshay Vegesna et al.

Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model toward exploring a population of models and aggregating their predictions. We introduce hyper-epoch pretraining (q0), which turns a multi-epoch budget into a population of diverse models whose combined predictions reach a lower validation loss than a single refined model. q0 reduces to three core primitives. A cyclic schedule with anti-correlated learning rate and weight decay collects diverse models from a few parallel trajectories. Chain distillation trains each model against its predecessor so that model quality compounds across the population. A learned prior, fit on a held out set, selects and weights members for any inference budget. On a 1.8B-parameter model trained on 100M FineWeb tokens, q0 matches a strong 256-epoch ensemble baseline using only ~56 epochs (~4.6x fewer), or ~67 epochs (~3.8x fewer) when matched to the baseline's ensemble size, and continues to improve beyond it. These gains reach cumulative ~12.9x data efficiency under the Slowrun setting and transfer to downstream benchmarks. Crucially, the optimal allocation shifts with the budget, so we give prescriptive recipes for how to spend a given epoch budget to maximize generalization, from a single epoch up to the largest budgets.

LGSep 13, 2025
Decoupling Search and Learning in Neural Net Training

Akshay Vegesna, Samip Dahal

Gradient descent typically converges to a single minimum of the training loss without mechanisms to explore alternative minima that may generalize better. Searching for diverse minima directly in high-dimensional parameter space is generally intractable. To address this, we propose a framework that performs training in two distinct phases: search in a tractable representation space (the space of intermediate activations) to find diverse representational solutions, and gradient-based learning in parameter space by regressing to those searched representations. Through evolutionary search, we discover representational solutions whose fitness and diversity scale with compute--larger populations and more generations produce better and more varied solutions. These representations prove to be learnable: networks trained by regressing to searched representations approach SGD's performance on MNIST, CIFAR-10, and CIFAR-100. Performance improves with search compute up to saturation. The resulting models differ qualitatively from networks trained with gradient descent, following different representational trajectories during training. This work demonstrates how future training algorithms could overcome gradient descent's exploratory limitations by decoupling search in representation space from efficient gradient-based learning in parameter space.