CVJan 4, 2022

PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture

arXiv:2201.00978v122 citationsHas Code
Originality Incremental advance
AI Analysis

This work provides an incremental improvement for computer vision researchers and practitioners by offering a more effective baseline for vision transformer architectures.

The paper tackles improving vision transformer performance by introducing PyramidTNT, which incorporates a pyramid architecture and convolutional stem to enhance hierarchical representations, achieving better results than previous state-of-the-art models like Swin Transformer.

Transformer networks have achieved great progress for computer vision tasks. Transformer-in-Transformer (TNT) architecture utilizes inner transformer and outer transformer to extract both local and global representations. In this work, we present new TNT baselines by introducing two advanced designs: 1) pyramid architecture, and 2) convolutional stem. The new "PyramidTNT" significantly improves the original TNT by establishing hierarchical representations. PyramidTNT achieves better performances than the previous state-of-the-art vision transformers such as Swin Transformer. We hope this new baseline will be helpful to the further research and application of vision transformer. Code will be available at https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes