CVMar 31, 2022

FindIt: Generalized Localization with Natural Language Queries

arXiv:2203.17273v218 citations
Originality Incremental advance
AI Analysis

This work addresses the need for versatile and efficient models in computer vision, offering a unified solution that could benefit applications like robotics and image analysis, though it appears incremental by building on existing object detection techniques.

The authors tackled the problem of unifying various visual grounding and localization tasks, such as referring expression comprehension and object detection, by proposing FindIt, a simple framework that outperforms state-of-the-art methods on key tasks and shows competitive performance on others, with improved generalization to out-of-distribution data.

We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection. Key to our architecture is an efficient multi-scale fusion module that unifies the disparate localization requirements across the tasks. In addition, we discover that a standard object detector is surprisingly effective in unifying these tasks without a need for task-specific design, losses, or pre-computed detections. Our end-to-end trainable framework responds flexibly and accurately to a wide range of referring expression, localization or detection queries for zero, one, or multiple objects. Jointly trained on these tasks, FindIt outperforms the state of the art on both referring expression and text-based localization, and shows competitive performance on object detection. Finally, FindIt generalizes better to out-of-distribution data and novel categories compared to strong single-task baselines. All of these are accomplished by a single, unified and efficient model. The code will be released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes