CVMay 22, 2022

Dynamic Query Selection for Fast Visual Perceiver

arXiv:2205.10873v21 citationsh-index: 60
Originality Incremental advance
AI Analysis

This work addresses efficiency for vision tasks, but it is incremental as it builds on the existing Perceiver architecture.

The paper tackles the problem of reducing inference time and network complexity in vision transformers by proposing a method to dynamically decrease the number of queries in Perceiver models, achieving a 30% speedup with only a 1.2% accuracy drop on ImageNet.

Transformers have been matching deep convolutional networks for vision architectures in recent works. Most work is focused on getting the best results on large-scale benchmarks, and scaling laws seem to be the most successful strategy: bigger models, more data, and longer training result in higher performance. However, the reduction of network complexity and inference time remains under-explored. The Perceiver model offers a solution to this problem: by first performing a Cross-attention with a fixed number Q of latent query tokens, the complexity of the L-layers Transformer network that follows is bounded by O(LQ^2). In this work, we explore how to make Perceivers even more efficient, by reducing the number of queries Q during inference while limiting the accuracy drop.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes