LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision
This addresses the high annotation cost problem for computer vision researchers and practitioners, offering a more data-efficient approach, though it is incremental as it builds on existing vision+language pre-training methods.
The paper tackles the problem of reducing annotation effort in computer vision by proposing LocTex, which uses low-cost localized textual annotations like captions and mouse-over gestures to learn visual representations, achieving comparable or improved performance on COCO instance segmentation with 10x smaller pre-training datasets and 4% higher accuracy on PASCAL VOC classification compared to previous methods.
Computer vision tasks such as object detection and semantic/instance segmentation rely on the painstaking annotation of large training datasets. In this paper, we propose LocTex that takes advantage of the low-cost localized textual annotations (i.e., captions and synchronized mouse-over gestures) to reduce the annotation effort. We introduce a contrastive pre-training framework between images and captions and propose to supervise the cross-modal attention map with rendered mouse traces to provide coarse localization signals. Our learned visual features capture rich semantics (from free-form captions) and accurate localization (from mouse traces), which are very effective when transferred to various downstream vision tasks. Compared with ImageNet supervised pre-training, LocTex can reduce the size of the pre-training dataset by 10x or the target dataset by 2x while achieving comparable or even improved performance on COCO instance segmentation. When provided with the same amount of annotations, LocTex achieves around 4% higher accuracy than the previous state-of-the-art "vision+language" pre-training approach on the task of PASCAL VOC image classification.