CVFeb 22, 2022

HiP: Hierarchical Perceiver

Joao Carreira, Skanda Koppula, Daniel Zoran, Adria Recasens, Catalin Ionescu, Olivier Henaff, Evan Shelhamer, Relja Arandjelovic, Matt Botvinick, Oriol Vinyals, Karen Simonyan, Andrew Zisserman

arXiv:2202.10890v214.114 citationsh-index: 164

Originality Incremental advance

AI Analysis

This work addresses the problem of scaling general perception models for researchers and practitioners, though it is incremental by building on existing Perceiver architectures.

The paper tackles the scalability limitation of general perception systems like Perceivers by introducing locality and self-supervised learning of positional embeddings, enabling processing of raw high-resolution images and audio+video with competitive performance on datasets such as ImageNet and AudioSet.

General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by using exclusively global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of locality can be introduced back into these models, greatly improving their efficiency while preserving their generality. To scale them further, we introduce a self-supervised approach that enables learning dense low-dimensional positional embeddings for very large signals. We call the resulting model a Hierarchical Perceiver (HiP). In sum our contributions are: 1) scaling Perceiver-type models to raw high-resolution images and audio+video, 2) showing the feasibility of learning 1M+ positional embeddings from scratch using masked auto-encoding, 3) demonstrating competitive performance on raw data from ImageNet, AudioSet, PASCAL VOC, ModelNet40 and Kinetics datasets with the same exact, unchanged model and without specialized preprocessing or any tokenization.

View on arXiv PDF

Similar