CVAIOct 6, 2023

Sub-token ViT Embedding via Stochastic Resonance Transformers

arXiv:2310.03967v28 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses the issue of fine-grained inference tasks in computer vision by mitigating the spatial coarseness of ViT representations, though it is incremental as it builds on existing ViT architectures.

The paper tackles the problem of coarse spatial quantization in Vision Transformers (ViTs) by introducing a training-free method called Stochastic Resonance Transformers (SRT) that performs sub-token spatial transformations and aggregation, resulting in performance boosts of up to 14.9% on tasks like segmentation and classification without fine-tuning.

Vision Transformer (ViT) architectures represent images as collections of high-dimensional vectorized tokens, each corresponding to a rectangular non-overlapping patch. This representation trades spatial granularity for embedding dimensionality, and results in semantically rich but spatially coarsely quantized feature maps. In order to retrieve spatial details beneficial to fine-grained inference tasks we propose a training-free method inspired by "stochastic resonance". Specifically, we perform sub-token spatial transformations to the input data, and aggregate the resulting ViT features after applying the inverse transformation. The resulting "Stochastic Resonance Transformer" (SRT) retains the rich semantic information of the original representation, but grounds it on a finer-scale spatial domain, partly mitigating the coarse effect of spatial tokenization. SRT is applicable across any layer of any ViT architecture, consistently boosting performance on several tasks including segmentation, classification, depth estimation, and others by up to 14.9% without the need for any fine-tuning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes