Container: Context Aggregation Network
This work addresses the need for efficient and scalable vision architectures that combine the strengths of CNNs and Transformers, offering large performance gains for researchers and practitioners in computer vision, though it is incremental in building upon existing methods.
The paper tackles the problem of unifying disparate architectures like CNNs, Transformers, and MLP-Mixers for computer vision by proposing Container, a general-purpose building block for multi-head context aggregation that exploits long-range interactions and local convolution inductive bias. It results in significant improvements, such as detection mAP gains of 6.6 to 7.3 points and mask mAP gains of 6.6 points compared to ResNet-50 backbones in tasks like object detection and instance segmentation.
Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers -- originally introduced in natural language processing -- have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP based solution without any traditional convolutional or Transformer components can produce effective visual representations. While CNNs, Transformers and MLP-Mixers may be considered as completely disparate architectures, we provide a unified view showing that they are in fact special cases of a more general method to aggregate spatial context in a neural network stack. We present the \model (CONText AggregatIon NEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactions \emph{a la} Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, often seen in CNNs. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named \modellight, can be employed in object detection and instance segmentation networks such as DETR, RetinaNet and Mask-RCNN to obtain an impressive detection mAP of 38.9, 43.8, 45.1 and mask mAP of 41.3, providing large improvements of 6.6, 7.3, 6.9 and 6.6 pts respectively, compared to a ResNet-50 backbone with a comparable compute and parameter size. Our method also achieves promising results on self-supervised learning compared to DeiT on the DINO framework. Code is released at \url{https://github.com/allenai/container}.