CV AI LGDec 1, 2024

Visual Modality Prompt for Adapting Vision-Language Object Detectors

Heitor R. Medeiros, Atif Belal, Srikanth Muralidharan, Eric Granger, Marco Pedersoli

arXiv:2412.00622v27.65 citationsh-index: 19Has Code

Originality Incremental advance

AI Analysis

This addresses the modality adaptation problem for vision-language object detection, enabling more robust applications in diverse visual environments, though it is incremental as it builds on existing prompt strategies.

The paper tackles the problem of adapting vision-language object detectors to new visual modalities like infrared and depth without degrading their zero-shot capabilities, proposing ModPrompt, a visual prompt strategy that achieves performance comparable to full fine-tuning while preserving zero-shot ability on datasets such as LLVIP, FLIR, and NYUv2.

The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image, making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inference-friendly modality prompt decoupled residual, facilitating a more robust adaptation. Empirical benchmarking results show our method for modality adaptation on two vision-language detectors, YOLO-World and Grounding DINO, and on challenging infrared (LLVIP, FLIR) and depth (NYUv2) datasets, achieving performance comparable to full fine-tuning while preserving the model's zero-shot capability. Code available at: https://github.com/heitorrapela/ModPrompt.

View on arXiv PDF Code

Similar