CVNov 11, 2022

Token Transformer: Can class token help window-based transformer build better long-range interactions?

arXiv:2211.06083v2h-index: 9
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in vision transformers for researchers and practitioners, offering an incremental improvement in efficiency and performance.

The authors tackled the limited long-range modeling capabilities of window-based transformers by introducing a Token Transformer with a Class token for summarizing window information, achieving competitive results in image classification and downstream tasks with low parameters.

Compared with the vanilla transformer, the window-based transformer offers a better trade-off between accuracy and efficiency. Although the window-based transformer has made great progress, its long-range modeling capabilities are limited due to the size of the local window and the window connection scheme. To address this problem, we propose a novel Token Transformer (TT). The core mechanism of TT is the addition of a Class (CLS) token for summarizing window information in each local window. We refer to this type of token interaction as CLS Attention. These CLS tokens will interact spatially with the tokens in each window to enable long-range modeling. In order to preserve the hierarchical design of the window-based transformer, we designed Feature Inheritance Module (FIM) in each phase of TT to deliver the local window information from the previous phase to the CLS token in the next phase. In addition, we have designed a Spatial-Channel Feedforward Network (SCFFN) in TT, which can mix CLS tokens and embedded tokens on the spatial domain and channel domain without additional parameters. Extensive experiments have shown that our TT achieves competitive results with low parameters in image classification and downstream tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes