CVSep 24, 2025

Enhancing Transformer-Based Vision Models: Addressing Feature Map Anomalies Through Novel Optimization Strategies

arXiv:2509.19687v14 citationsh-index: 2ISBDAI

Originality Incremental advance

AI Analysis

This addresses noise artifacts in vision transformers for computer vision applications, but appears incremental as it builds on existing ViT frameworks.

The paper tackled feature map anomalies in Vision Transformers that hinder segmentation and depth estimation, proposing two optimization techniques (STA and ANF) that improved visual quality and task performance across benchmarks like ImageNet, Ade20k, and NYUv2.

Vision Transformers (ViTs) have demonstrated superior performance across a wide range of computer vision tasks. However, structured noise artifacts in their feature maps hinder downstream applications such as segmentation and depth estimation. We propose two novel and lightweight optimisation techniques- Structured Token Augmentation (STA) and Adaptive Noise Filtering (ANF)- to improve interpretability and mitigate these artefacts. STA enhances token diversity through spatial perturbations during tokenisation, while ANF applies learnable inline denoising between transformer layers. These methods are architecture-agnostic and evaluated across standard benchmarks, including ImageNet, Ade20k, and NYUv2. Experimental results show consistent improvements in visual quality and task performance, highlighting the practical effectiveness of our approach.

View on arXiv PDF

Similar