Differentiable Kernel Ridge Regression for Deep Learning Pipelines
This work provides a practical way to combine kernel methods with deep learning, offering new design flexibility for practitioners, though the improvements are incremental over existing approaches.
The authors introduce Sparse Kernels (SKs), a differentiable and localized variant of kernel ridge regression that can be integrated as modular layers in deep learning pipelines. SKs match or improve performance of neural readouts across CNNs, ViTs, and RL with less training, enabling training-free transfer and hybrid models.
Deep neural networks dominate modern machine learning, while alternative function approximators remain comparatively underexplored at scale. In this work, we revisit kernel methods as drop-in components for standard deep learning pipelines. We introduce \emph{Sparse Kernels} (SKs), a differentiable, localized, and lazy variant of kernel ridge regression (KRR) that defers training to inference time and reduces to the solution of small local systems. We integrate SKs into PyTorch as modular layers that preserve end-to-end trainability, and we show that they expose three distinct sets of parameters -- feature representations, target values, and evaluation points -- each of which can be fixed or learned. This decomposition broadens the design space available to practitioners, enabling, in particular, training-free transfer, nonlinear probing, and hybrid kernel-neural models. Across convolutional networks, vision transformers, and reinforcement learning, SK-based modules serve two complementary roles: in some settings, they match the performance of trained neural readouts with substantially less training; in others, they augment existing models and improve their performance when used as additional components. Our results suggest that kernel methods, once made scalable and differentiable, can be readily integrated with deep learning rather than treated as a separate paradigm.