CVNov 23, 2024

Efficient Online Inference of Vision Transformers by Training-Free Tokenization

arXiv:2411.15397v3h-index: 17Trans. Mach. Learn. Res.
Originality Incremental advance
AI Analysis

This addresses the barrier to industrial adoption of vision transformers by enabling more efficient real-time inference with minimal performance compromise.

The paper tackles the high deployment cost of vision transformers for online inference by introducing a training-free tokenization method that groups frequent image patches, achieving up to 25% reduction in wattage with at most a 20% increase in runtime.

The cost of deploying vision transformers increasingly represents a barrier to wider industrial adoption. Existing compression techniques require additional end-to-end fine-tuning or incur a significant drawback to runtime, making them ill-suited for online (real-time) inference, where a prediction is made on any new input as it comes in. We introduce the $\textbf{Visual Word Tokenizer}$ (VWT), a training-free method for reducing power costs while retaining performance and runtime. The VWT groups visual subwords (image patches) that are frequently used into visual words while infrequent ones remain intact. To do so, $\textit{intra}$-image or $\textit{inter}$-image statistics are leveraged to identify similar visual concepts for sequence compression. Experimentally, we demonstrate a reduction in wattage of up to 25% with only a 20% increase in runtime at most. Comparative approaches of 8-bit quantization and token merging achieve a lower or similar power efficiency but exact a higher toll on runtime (up to 100% or more). Our results indicate that VWTs are well-suited for efficient online inference with a marginal compromise on performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes