CVAIFeb 5

CAViT -- Channel-Aware Vision Transformer for Dynamic Feature Fusion

arXiv:2602.05598v1h-index: 2
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in Vision Transformers for computer vision tasks, offering a domain-specific improvement with competitive gains.

The paper tackled the problem of static channel-wise mixing in Vision Transformers by introducing CAViT, a dual-attention architecture that dynamically recalibrates features, resulting in up to +3.6% accuracy improvement and over 30% reduction in parameters and FLOPs across five benchmark datasets.

Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range spatial interactions via self-attention. However, channel-wise mixing in ViTs remains static, relying on fixed multilayer perceptrons (MLPs) that lack adaptability to input content. We introduce 'CAViT', a dual-attention architecture that replaces the static MLP with a dynamic, attention-based mechanism for feature interaction. Each Transformer block in CAViT performs spatial self-attention followed by channel-wise self-attention, allowing the model to dynamically recalibrate feature representations based on global image context. This unified and content-aware token mixing strategy enhances representational expressiveness without increasing depth or complexity. We validate CAViT across five benchmark datasets spanning both natural and medical domains, where it outperforms the standard ViT baseline by up to +3.6% in accuracy, while reducing parameter count and FLOPs by over 30%. Qualitative attention maps reveal sharper and semantically meaningful activation patterns, validating the effectiveness of our attention-driven token mixing.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes