CVOct 26, 2025

Alias-Free ViT: Fractional Shift Invariance via Linear Attention

arXiv:2510.22673v12 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the problem of sensitivity to image translations for Vision Transformers, offering a domain-specific improvement that is incremental over existing anti-aliasing methods.

The paper tackled the lack of translation invariance in Vision Transformers by proposing an Alias-Free ViT that uses alias-free downsampling and linear cross-covariance attention, resulting in improved robustness to adversarial translations while maintaining competitive image classification performance.

Transformers have emerged as a competitive alternative to convnets in vision tasks, yet they lack the architectural inductive bias of convnets, which may hinder their potential performance. Specifically, Vision Transformers (ViTs) are not translation-invariant and are more sensitive to minor image translations than standard convnets. Previous studies have shown, however, that convnets are also not perfectly shift-invariant, due to aliasing in downsampling and nonlinear layers. Consequently, anti-aliasing approaches have been proposed to certify convnets' translation robustness. Building on this line of work, we propose an Alias-Free ViT, which combines two main components. First, it uses alias-free downsampling and nonlinearities. Second, it uses linear cross-covariance attention that is shift-equivariant to both integer and fractional translations, enabling a shift-invariant global representation. Our model maintains competitive performance in image classification and outperforms similar-sized models in terms of robustness to adversarial translations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes