Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding
This work addresses data scarcity in visual grounding, a domain-specific task, with incremental improvements in performance.
The paper tackles the problem of learning visual grounding under data-scarce settings by proposing POBF, a framework that synthesizes images and selects effective training data, achieving an average gain of 5.83% over real-data-only methods and outperforming baselines by 2.29%-3.85% in accuracy.
Visual grounding aims to localize the image regions based on a textual query. Given the difficulty of large-scale data curation, we investigate how to effectively learn visual grounding under data-scarce settings in this paper. To address the data scarcity, we propose a novel framework, POBF (Paint Outside the Box and Filter). POBF synthesizes images by inpainting outside the box, tackling a label misalignment issue encountered in previous works. Furthermore, POBF leverages an innovative filtering scheme to select the most effective training data. This scheme combines a hardness score and an overfitting score, balanced by a penalty term. Extensive experiments across four benchmark datasets demonstrate that POBF consistently improves performance, achieving an average gain of 5.83\% over the real-data-only method and outperforming leading baselines by 2.29\%-3.85\% in accuracy. Additionally, we validate the robustness and generalizability of POBF across various generative models, training data sizes, and model architectures.