Infinite Width Models That Work: Why Feature Learning Doesn't Matter as Much as You Think
This addresses the theoretical and practical limitations of infinite-width models for machine learning researchers, offering a novel construction that improves performance.
The paper challenges the belief that infinite-width models like Neural Tangent Kernels (NTKs) underperform due to lack of feature learning, showing they can achieve similar behavior by selecting subfeatures from infinite frozen vectors, and introduces a new infinite-width limit using ADAM-like dynamics that eliminates the performance gap with finite models.
Common infinite-width architectures such as Neural Tangent Kernels (NTKs) have historically shown weak performance compared to finite models. This is usually attributed to the absence of feature learning. We show that this explanation is insufficient. Specifically, we show that infinite width NTKs obviate the need for feature learning. They can learn identical behavior by selecting relevant subfeatures from their (infinite) frozen feature vector. Furthermore, we show experimentally that NTKs under-perform traditional finite models even when feature learning is artificially disabled. Instead, we show that weak performance is at least partly due to the fact that existing constructions depend on weak optimizers like SGD. We provide a new infinite width limit based on ADAM-like learning dynamics and demonstrate empirically that the resulting models erase this performance gap.