Precise asymptotic analysis of Sobolev training for random feature models
This work addresses the theoretical gap in Sobolev training for overparameterized models, providing insights for practitioners in machine learning on when to use gradient data, though it is incremental as it builds on existing random feature models.
The paper tackles the problem of understanding how Sobolev training, which uses both function and gradient data, affects generalization error in overparameterized random feature models. The result is a precise asymptotic characterization showing that adding gradient data does not universally improve performance, and optimal performance depends on the degree of overparameterization.
Gradient information is widely useful and available in applications, and is therefore natural to include in the training of neural networks. Yet little is known theoretically about the impact of Sobolev training -- regression with both function and gradient data -- on the generalization error of highly overparameterized predictive models in high dimensions. In this paper, we obtain a precise characterization of this training modality for random feature (RF) models in the limit where the number of trainable parameters, input dimensions, and training data tend proportionally to infinity. Our model for Sobolev training reflects practical implementations by sketching gradient data onto finite dimensional subspaces. By combining the replica method from statistical physics with linearizations in operator-valued free probability theory, we derive a closed-form description for the generalization errors of the trained RF models. For target functions described by single-index models, we demonstrate that supplementing function data with additional gradient data does not universally improve predictive performance. Rather, the degree of overparameterization should inform the choice of training method. More broadly, our results identify settings where models perform optimally by interpolating noisy function and gradient data.