CVFeb 20, 2025

Simpler Fast Vision Transformers with a Jumbo CLS Token

arXiv:2502.15021v23 citationsh-index: 6Has Code
Originality Incremental advance
AI Analysis

This work addresses efficiency and accuracy challenges in vision transformers for computer vision applications, representing an incremental improvement with specific gains in small, high-speed models.

The paper tackles the problem of improving vision transformer (ViT) accuracy while maintaining throughput by introducing Jumbo, a method that enhances the CLS token with a wider design and dedicated processing, achieving a 13% improvement over ViT-nano+Registers on ImageNet-1K and ImageNet-21K.

We introduce a simple enhancement of vision transformers (ViTs) to improve accuracy while maintaining throughput. Our approach, Jumbo, creates a wider CLS token, which is split to match the patch token width before attention, processed with self-attention, and reassembled. After attention, Jumbo applies a dedicated, wider FFN to this token. Since there is only one Jumbo token, its cost is minimal, and because we share this FFN across layers, its parameter count is controlled. Jumbo significantly improves over ViT+Registers on ImageNet-1K and ImageNet-21K. These gains are largest at small sizes / high speeds, e.g., ViT-nano+Jumbo outperforms ViT-nano+Registers by 13%. In fact, our Jumbo models are so efficient that they outperform specialized compute-efficient models while preserving the architectural advantages of plain ViTs, such as support for token dropping and other modalities. Accordingly, we demonstrate that Jumbo excels in these two settings via masked autoencoding and on a suite of time series benchmarks. Code and weights available: https://github.com/antofuller/jumbo

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes