CVAIMay 26

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

arXiv:2605.2663681.6
Predicted impact top 26% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners deploying high-resolution vision models, JetViT offers a practical post-training acceleration method that maintains accuracy while significantly improving inference efficiency.

JetViT introduces a post-training attention search to convert full-attention ViTs into hybrid-attention variants, achieving up to 1.79x higher throughput and 44.81% lower latency on high-resolution images without accuracy loss.

We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficiency on high-resolution images. At the core of our approach is Post-Training Attention Search, a post-training acceleration framework that converts pre-trained full-attention ViTs into efficient hybrid-attention variants by identifying and replacing redundant full-attention blocks with linear or window-attention blocks. By inheriting the MLP and attention weights from the base model, Post-Training Attention Search efficiently explores the architectural design space through three key steps: (1) optimizing the linear-attention block design; (2) finding the best combination of linear-attention and window-attention blocks; and (3) identifying and preserving critical full-attention blocks. We evaluate JetViT on two representative high-resolution vision foundation models, DINOv3 and DepthAnythingV2. On the NVIDIA H100 GPU, JetViT achieves up to 1.79x higher throughput and up to 44.81% lower latency without sacrificing accuracy. We will release our code and accelerated ViT models soon.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes