CVLGJun 1, 2023

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Meta AI
arXiv:2306.00989v1405 citationsh-index: 63Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of slow and bulky vision transformers for researchers and practitioners in computer vision, offering a more efficient alternative without sacrificing performance, though it is incremental in simplifying existing architectures.

The paper tackles the complexity and inefficiency of modern hierarchical vision transformers by proposing Hiera, a simplified model that removes unnecessary components while maintaining accuracy through MAE pretraining, achieving higher accuracy and significantly faster inference and training speeds than previous models.

Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes