CVMay 21

Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

arXiv:2605.2213256.0
Predicted impact top 56% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses inference bottlenecks of large vision models for deployment on resource-constrained devices.

The authors accelerate pretrained Vision Transformers by replacing certain attention heads with depthwise convolutions, achieving 17-20% inference speedup with minimal performance degradation on classification and segmentation tasks.

Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In this work, we accelerate large-scale pretrained ViTs while preserving their feature extraction capabilities by exploiting the intrinsic convolution-like behavior of some attention heads. Specifically, we introduce an efficient depthwise convolution-based layer that serves as a drop-in replacement for these heads. Additionally, we propose simple strategies to identify which heads can be replaced and introduce a fine-tuning procedure that recovers downstream task performance. Across both image classification and segmentation tasks, our method achieves 17-20\% percent inference speedup with minimal performance degradation. We validate the approach through detailed derivations, extensive experiments, and efficiency benchmarks. The reference implementation is publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes