SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation
This addresses the problem of high annotation costs for researchers and practitioners in computer vision, offering a scalable solution for dataset generation, though it is incremental as it builds on existing methods for synthetic data.
The paper tackles the bottleneck of expensive annotation for visual grounding tasks by proposing SynthRef, a method for generating synthetic referring expressions, and releases a large-scale dataset for video object segmentation. Experiments show that training with synthetic expressions improves model generalization across datasets without additional annotation cost.
Recent advances in deep learning have brought significant progress in visual grounding tasks such as language-guided video object segmentation. However, collecting large datasets for these tasks is expensive in terms of annotation time, which represents a bottleneck. To this end, we propose a novel method, namely SynthRef, for generating synthetic referring expressions for target objects in an image (or video frame), and we also present and disseminate the first large-scale dataset with synthetic referring expressions for video object segmentation. Our experiments demonstrate that by training with our synthetic referring expressions one can improve the ability of a model to generalize across different datasets, without any additional annotation cost. Moreover, our formulation allows its application to any object detection or segmentation dataset.