CVSep 24, 2025

Enhancing Transformer-Based Vision Models: Addressing Feature Map Anomalies Through Novel Optimization Strategies

arXiv:2509.19687v14 citationsh-index: 2ISBDAI
Originality Incremental advance
AI Analysis

This addresses noise artifacts in vision transformers for computer vision applications, but appears incremental as it builds on existing ViT frameworks.

The paper tackled feature map anomalies in Vision Transformers that hinder segmentation and depth estimation, proposing two optimization techniques (STA and ANF) that improved visual quality and task performance across benchmarks like ImageNet, Ade20k, and NYUv2.

Vision Transformers (ViTs) have demonstrated superior performance across a wide range of computer vision tasks. However, structured noise artifacts in their feature maps hinder downstream applications such as segmentation and depth estimation. We propose two novel and lightweight optimisation techniques- Structured Token Augmentation (STA) and Adaptive Noise Filtering (ANF)- to improve interpretability and mitigate these artefacts. STA enhances token diversity through spatial perturbations during tokenisation, while ANF applies learnable inline denoising between transformer layers. These methods are architecture-agnostic and evaluated across standard benchmarks, including ImageNet, Ade20k, and NYUv2. Experimental results show consistent improvements in visual quality and task performance, highlighting the practical effectiveness of our approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes