CV LGJul 21, 2024

Efficient Visual Transformer by Learnable Token Merging

arXiv:2407.15219v212.813 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This addresses the computational bottleneck in visual transformers for computer vision applications, offering an incremental improvement in efficiency.

The paper tackles the inefficiency of visual transformers by proposing LTM-Transformer, a novel block that uses learnable token merging to reduce FLOPs and inference time while maintaining or improving accuracy, achieving comparable or better results on various backbones like MobileViT and Swin.

Self-attention and transformers have been widely used in deep learning. Recent efforts have been devoted to incorporating transformer blocks into different neural architectures, including those with convolutions, leading to various visual transformers for computer vision tasks. In this paper, we propose a novel and compact transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer. LTM-Transformer performs token merging in a learnable scheme. LTM-Transformer is compatible with many popular and compact transformer networks, and it reduces the FLOPs and the inference time of the visual transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in popular visual transformers, including MobileViT, EfficientViT, ViT, and Swin, with LTM-Transformer blocks, leading to LTM-Transformer networks with different backbones. The LTM-Transformer is motivated by reduction of Information Bottleneck, and a novel and separable variational upper bound for the IB loss is derived. The architecture of the mask module in our LTM blocks, which generates the token merging mask, is designed to reduce the derived upper bound for the IB loss. Extensive results on computer vision tasks evidence that LTM-Transformer renders compact and efficient visual transformers with comparable or much better prediction accuracy than the original visual transformers. The code of the LTM-Transformer is available at https://github.com/Statistical-Deep-Learning/LTM}

View on arXiv PDF Code

Similar