CVJan 10, 2023

Dynamic Grained Encoder for Vision Transformers

arXiv:2301.03831v141 citationsh-index: 38Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of high computational costs in vision transformers for computer vision tasks, offering an incremental improvement in efficiency.

The paper tackles computational inefficiency in vision transformers by introducing a Dynamic Grained Encoder that adaptively assigns queries to spatial regions, reducing complexity by 40%-60% while maintaining comparable performance on image classification.

Transformers, the de-facto standard for language modeling, have been recently applied for vision tasks. This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images and save computational costs. Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region. Thus it achieves a fine-grained representation in discriminative regions while keeping high efficiency. Besides, the dynamic grained encoder is compatible with most vision transformer frameworks. Without bells and whistles, our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification. Extensive experiments on object detection and segmentation further demonstrate the generalizability of our approach. Code is available at https://github.com/StevenGrove/vtpack.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes