LGAIITMLJun 4, 2024

Can Kernel Methods Explain How the Data Affects Neural Collapse?

arXiv:2406.02105v34 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the limitation of existing theoretical models in explaining how data affects neural collapse for researchers in deep learning theory, though it is incremental as it highlights gaps rather than providing a complete solution.

The paper tackled the problem of analyzing Neural Collapse (NC1) in neural networks using kernel methods, showing that data-independent kernels like NTK fail to capture NC behavior, while a data-aware kernel reduces NC1 but does not fully match shallow network trends.

A vast amount of literature has recently focused on the "Neural Collapse" (NC) phenomenon, which emerges when training neural network (NN) classifiers beyond the zero training error point. The core component of NC is the decrease in the within-class variability of the network's deepest features, dubbed as NC1. The theoretical works that study NC are typically based on simplified unconstrained features models (UFMs) that mask any effect of the data on the extent of collapse. To address this limitation of UFMs, this paper explores the possibility of analyzing NC1 using kernels associated with shallow NNs. We begin by formulating an NC1 metric as a function of the kernel. Then, we specialize it to the NN Gaussian Process kernel (NNGP) and the Neural Tangent Kernel (NTK), associated with wide networks at initialization and during gradient-based training with a small learning rate, respectively. As a key result, we show that the NTK does not represent more collapsed features than the NNGP for Gaussian data of arbitrary dimensions. This showcases the limitations of data-independent kernels such as NTK in approximating the NC behavior of NNs. As an alternative to NTK, we then empirically explore a recently proposed data-aware Gaussian Process kernel, which generalizes NNGP to model feature learning. We show that this kernel yields lower NC1 than NNGP but may not follow the trends of the shallow NN. Our study demonstrates that adaptivity to data may allow kernel-based analysis of NC, though further advancements in this area are still needed. A nice byproduct of our study is showing both theoretically and empirically that the choice of nonlinear activation function affects NC1 (with ERF yielding lower values than ReLU). The code is available at: https://github.com/kvignesh1420/shallow_nc1

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes