CVSep 30, 2022

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

arXiv:2209.15639v2191 citationsh-index: 42
Originality Highly original
AI Analysis

This addresses the problem of detecting objects from arbitrary categories for computer vision applications, offering a more efficient and scalable approach.

The paper tackles open-vocabulary object detection by proposing F-VLM, a method that uses frozen vision and language models to simplify training, achieving a +6.5 mask AP improvement on novel categories in the LVIS benchmark.

We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of the art on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speed-up and compute savings. Code will be released at the https://sites.google.com/view/f-vlm/home

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes