Vision Transformers that Never Stop Learning
This work addresses the fundamental challenge of loss of plasticity in continual learning for Vision Transformers, a problem for researchers developing adaptable AI systems.
This paper investigates the loss of plasticity in Vision Transformers (ViTs), identifying that stacked attention modules exhibit increasing instability and feed-forward networks suffer pronounced degradation. They propose ARROW, a geometry-aware optimizer that adaptively reshapes gradient directions for attention modules, which effectively improves plasticity and maintains better performance on new tasks.
Loss of plasticity refers to the progressive inability of a model to adapt to new tasks and poses a fundamental challenge for continual learning. While this phenomenon has been extensively studied in homogeneous neural architectures, such as multilayer perceptrons, its mechanisms in structurally heterogeneous, attention-based models such as Vision Transformers (ViTs) remain underexplored. In this work, we present a systematic investigation of loss of plasticity in ViTs, including a fine-grained diagnosis using local metrics that capture parameter diversity and utilization. Our analysis reveals that stacked attention modules exhibit increasing instability that exacerbates plasticity loss, while feed-forward network modules suffer even more pronounced degradation. Furthermore, we evaluate several approaches for mitigating plasticity loss. The results indicate that methods based on parameter re-initialization fail to recover plasticity in ViTs, whereas approaches that explicitly regulate the update process are more effective. Motivated by this insight, we propose ARROW, a geometry-aware optimizer that preserves plasticity by adaptively reshaping gradient directions using an online curvature estimate for the attention module. Extensive experiments show that ARROW effectively improves plasticity and maintains better performance on newly encountered tasks.