CVDec 12, 2021

Improving Vision Transformers for Incremental Learning

arXiv:2112.06103v318 citations
Originality Incremental advance
AI Analysis

This work addresses incremental learning for vision tasks, but it is incremental as it combines existing techniques to improve ViT performance.

The paper tackles the problem of applying Vision Transformers (ViTs) to class incremental learning, where naive use leads to performance degradation, and proposes ViTIL, which achieves new state-of-the-art results on CIFAR and ImageNet datasets by clear margins.

This paper proposes a working recipe of using Vision Transformer (ViT) in class incremental learning. Although this recipe only combines existing techniques, developing the combination is not trivial. Firstly, naive application of ViT to replace convolutional neural networks (CNNs) in incremental learning results in serious performance degradation. Secondly, we nail down three issues of naively using ViT: (a) ViT has very slow convergence when the number of classes is small, (b) more bias towards new classes is observed in ViT than CNN-based architectures, and (c) the conventional learning rate of ViT is too low to learn a good classifier layer. Finally, our solution, named ViTIL (ViT for Incremental Learning) achieves new state-of-the-art on both CIFAR and ImageNet datasets for all three class incremental learning setups by a clear margin. We believe this advances the knowledge of transformer in the incremental learning community. Code will be publicly released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes