LGNov 26, 2024

An In-depth Investigation of Sparse Rate Reduction in Transformer-like Models

arXiv:2411.17182v19.24 citationsh-index: 1NIPS

Originality Incremental advance

AI Analysis

This work addresses the problem of understanding and improving generalization in deep neural networks for researchers and practitioners, but it is incremental as it builds on prior work on SRR and CRATE.

The paper investigates whether the Sparse Rate Reduction (SRR) objective is optimized in practice and causally linked to generalization in Transformer-like models, finding that SRR has a positive correlation coefficient with generalization and outperforms baseline measures like path-norm and sharpness-based ones, while also showing that using SRR as regularization improves generalization on benchmark image classification datasets.

Deep neural networks have long been criticized for being black-box. To unveil the inner workings of modern neural architectures, a recent work \cite{yu2024white} proposed an information-theoretic objective function called Sparse Rate Reduction (SRR) and interpreted its unrolled optimization as a Transformer-like model called Coding Rate Reduction Transformer (CRATE). However, the focus of the study was primarily on the basic implementation, and whether this objective is optimized in practice and its causal relationship to generalization remain elusive. Going beyond this study, we derive different implementations by analyzing layer-wise behaviors of CRATE, both theoretically and empirically. To reveal the predictive power of SRR on generalization, we collect a set of model variants induced by varied implementations and hyperparameters and evaluate SRR as a complexity measure based on its correlation with generalization. Surprisingly, we find out that SRR has a positive correlation coefficient and outperforms other baseline measures, such as path-norm and sharpness-based ones. Furthermore, we show that generalization can be improved using SRR as regularization on benchmark image classification datasets. We hope this paper can shed light on leveraging SRR to design principled models and study their generalization ability.

View on arXiv PDF

Similar