Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon
This work addresses a theoretical gap in data mixing methods for robust ML training, providing insights for practitioners but is incremental as it builds on existing bilevel optimization frameworks.
The paper tackles the problem of data mixing in bilevel optimization by analyzing the convergence behavior with a finite number of inner update steps, proving that using too few steps can fail and deriving optimal scaling laws for the number of steps.
Data mixing--the strategic reweighting of training domains--is a critical component in training robust machine learning models. This problem is naturally formulated as a bilevel optimization task, where the outer loop optimizes domain weights to minimize validation loss, and the inner loop optimizes model parameters to minimize the weighted training loss. Classical bilevel optimization relies on hypergradients, which theoretically require the inner optimization to reach convergence. However, due to computational constraints, state-of-the-art methods use a finite, often small, number of inner update steps before updating the weights. The theoretical implications of this approximation are not well understood. In this work, we rigorously analyze the convergence behavior of data mixing with a finite number of inner steps $T$. We prove that the "greedy" practical approach of using $T=1$ can fail even in a simple quadratic example. Under a fixed parameter update budget $N$ and assuming the per-domain losses are strongly convex, we show that the optimal $T$ scales as $Θ(\log N)$ (resp., $Θ({(N \log N)}^{1/2})$) for the data mixing problem with access to full (resp., stochastic) gradients. We complement our theoretical results with proof-of-concept experiments.