CVAILGMay 21, 2025

Octic Vision Transformers: Quicker ViTs Through Equivariance

arXiv:2505.15441v4h-index: 9Has Code
Originality Highly original
AI Analysis

This addresses the computational bottleneck in Vision Transformers for computer vision applications, offering a more efficient alternative without sacrificing performance.

The paper tackles the inefficiency of Vision Transformers by introducing Octic Vision Transformers that exploit geometric symmetries like rotations and reflections, achieving up to 5.33x reductions in FLOPs and 8x reductions in memory while matching baseline accuracy on ImageNet-1K.

Why are state-of-the-art Vision Transformers (ViTs) not designed to exploit natural geometric symmetries such as 90-degree rotations and reflections? In this paper, we argue that there is no fundamental reason, and what has been missing is an efficient implementation. To this end, we introduce Octic Vision Transformers (octic ViTs) which rely on octic group equivariance to capture these symmetries. In contrast to prior equivariant models that increase computational cost, our octic linear layers achieve 5.33x reductions in FLOPs and up to 8x reductions in memory compared to ordinary linear layers. In full octic ViT blocks the computational reductions approach the reductions in the linear layers with increased embedding dimension. We study two new families of ViTs, built from octic blocks, that are either fully octic equivariant or break equivariance in the last part of the network. Training octic ViTs supervised (DeiT-III) and unsupervised (DINOv2) on ImageNet-1K, we find that they match baseline accuracy while at the same time providing substantial efficiency gains.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes