CVMar 29, 2022

Fine-tuning Image Transformers using Learnable Memory

arXiv:2203.15243v257 citationsh-index: 22
Originality Incremental advance
AI Analysis

This addresses the need for parameter-efficient fine-tuning in computer vision, enabling models to handle multiple tasks with minimal computational overhead, though it is incremental as it builds on existing transformer architectures.

The paper tackles the problem of adapting Vision Transformers to new tasks efficiently by augmenting them with learnable memory tokens, resulting in improved accuracy compared to head-only fine-tuning and performance close to full fine-tuning with fewer parameters.

In this paper we propose augmenting Vision Transformer models with learnable memory tokens. Our approach allows the model to adapt to new tasks, using few parameters, while optionally preserving its capabilities on previously learned tasks. At each layer we introduce a set of learnable embedding vectors that provide contextual information useful for specific datasets. We call these "memory tokens". We show that augmenting a model with just a handful of such tokens per layer significantly improves accuracy when compared to conventional head-only fine-tuning, and performs only slightly below the significantly more expensive full fine-tuning. We then propose an attention-masking approach that enables extension to new downstream tasks, with a computation reuse. In this setup in addition to being parameters efficient, models can execute both old and new tasks as a part of single inference at a small incremental cost.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes