CVMar 28, 2024

LocCa: Visual Pretraining with Location-aware Captioners

arXiv:2403.19596v231 citationsh-index: 39NIPS
Originality Incremental advance
AI Analysis

This work addresses a gap in visual pretraining for researchers in computer vision, offering a simple method to improve localization tasks, though it is incremental as it builds on existing captioning approaches.

The paper tackled the problem of incorporating location-aware information into visual pretraining by proposing LocCa, a method that uses a location-aware captioner to output bounding box coordinates and captions, resulting in significant outperformance on localization downstream tasks while maintaining comparable performance on holistic tasks.

Image captioning has been shown as an effective pretraining method similar to contrastive pretraining. However, the incorporation of location-aware information into visual pretraining remains an area with limited research. In this paper, we propose a simple visual pretraining method with location-aware captioners (LocCa). LocCa uses a simple image captioner task interface, to teach a model to read out rich information, i.e. bounding box coordinates, and captions, conditioned on the image pixel input. Thanks to the multitask capabilities of an encoder-decoder architecture, we show that an image captioner can easily handle multiple tasks during pretraining. Our experiments demonstrate that LocCa outperforms standard captioners significantly on localization downstream tasks while maintaining comparable performance on holistic tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes