CVNov 22, 2021

Class-agnostic Object Detection with Multi-modal Transformer

arXiv:2111.11430v6125 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of detecting generic objects without predefined categories, which is crucial for applications like open-world detection, but it is incremental as it builds on existing multi-modal transformers.

The paper tackles the problem of class-agnostic object detection by proposing a multi-modal transformer trained with aligned image-text pairs, achieving state-of-the-art performance across various domains and novel objects.

What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs can adaptively generate proposals given a specific language query and thus offer enhanced interactability. Code: \url{https://git.io/J1HPY}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes