CVSep 29, 2023

Towards Complex-query Referring Image Segmentation: A Novel Benchmark

Wei Ji, Li Li, Hao Fei, Xiangyan Liu, Xun Yang, Juncheng Li, Roger Zimmermann

arXiv:2309.17205v19.812 citationsh-index: 23

Originality Synthesis-oriented

AI Analysis

This addresses the need for more realistic RIS evaluation for researchers, though it is incremental as it builds on existing datasets and methods.

The paper tackles the lack of benchmarks for complex language queries in Referring Image Segmentation (RIS) by proposing RIS-CQ, a new dataset built on RefCOCO and Visual Genome, and introduces DuMoGa, a method that outperforms existing RIS approaches.

Referring Image Understanding (RIS) has been extensively studied over the past decade, leading to the development of advanced algorithms. However, there has been a lack of research investigating how existing algorithms should be benchmarked with complex language queries, which include more informative descriptions of surrounding objects and backgrounds (\eg \textit{"the black car."} vs. \textit{"the black car is parking on the road and beside the bus."}). Given the significant improvement in the semantic understanding capability of large pre-trained models, it is crucial to take a step further in RIS by incorporating complex language that resembles real-world applications. To close this gap, building upon the existing RefCOCO and Visual Genome datasets, we propose a new RIS benchmark with complex queries, namely \textbf{RIS-CQ}. The RIS-CQ dataset is of high quality and large scale, which challenges the existing RIS with enriched, specific and informative queries, and enables a more realistic scenario of RIS research. Besides, we present a nichetargeting method to better task the RIS-CQ, called dual-modality graph alignment model (\textbf{\textsc{DuMoGa}}), which outperforms a series of RIS methods.

View on arXiv PDF

Similar