CVAug 26, 2025

Beyond flattening: a geometrically principled positional encoding for vision transformers with Weierstrass elliptic functions

arXiv:2508.19167v1

Originality Highly original

AI Analysis

This addresses a fundamental limitation in Vision Transformers for computer vision tasks, offering a geometrically principled solution with incremental improvements over existing methods.

The paper tackles the problem of positional encoding in Vision Transformers, which disrupts 2D spatial structure, by proposing Weierstrass Elliptic Function Positional Encoding (WEF-PE), achieving results such as 63.78% accuracy on CIFAR-100 with ViT-Tiny and 93.28% on CIFAR-100 with ViT-Base.

Vision Transformers have demonstrated remarkable success in computer vision tasks, yet their reliance on learnable one-dimensional positional embeddings fundamentally disrupts the inherent two-dimensional spatial structure of images through patch flattening procedures. Traditional positional encoding approaches lack geometric constraints and fail to establish monotonic correspondence between Euclidean spatial distances and sequential index distances, thereby limiting the model's capacity to leverage spatial proximity priors effectively. We propose Weierstrass Elliptic Function Positional Encoding (WEF-PE), a mathematically principled approach that directly addresses two-dimensional coordinates through natural complex domain representation, where the doubly periodic properties of elliptic functions align remarkably with translational invariance patterns commonly observed in visual data. Our method exploits the non-linear geometric nature of elliptic functions to encode spatial distance relationships naturally, while the algebraic addition formula enables direct derivation of relative positional information between arbitrary patch pairs from their absolute encodings. Comprehensive experiments demonstrate that WEF-PE achieves superior performance across diverse scenarios, including 63.78\% accuracy on CIFAR-100 from-scratch training with ViT-Tiny architecture, 93.28\% on CIFAR-100 fine-tuning with ViT-Base, and consistent improvements on VTAB-1k benchmark tasks. Theoretical analysis confirms the distance-decay property through rigorous mathematical proof, while attention visualization reveals enhanced geometric inductive bias and more coherent semantic focus compared to conventional approaches.The source code implementing the methods described in this paper is publicly available on GitHub.

View on arXiv PDF

Similar