Advancing Referring Expression Segmentation Beyond Single Image
This work addresses a practical problem for computer vision researchers by making RES more applicable to real-world scenarios with multiple images, though it is incremental as it builds on existing RES frameworks.
The authors tackled the limitation of Referring Expression Segmentation (RES) by proposing a new setting, Group-wise Referring Expression Segmentation (GRES), which expands RES to handle collections of images where objects may be present in only a subset, and introduced a dataset (GRD) and baseline method (GRSer) that achieved state-of-the-art results on GRES and related tasks.
Referring Expression Segmentation (RES) is a widely explored multi-modal task, which endeavors to segment the pre-existing object within a single image with a given linguistic expression. However, in broader real-world scenarios, it is not always possible to determine if the described object exists in a specific image. Typically, we have a collection of images, some of which may contain the described objects. The current RES setting curbs its practicality in such situations. To overcome this limitation, we propose a more realistic and general setting, named Group-wise Referring Expression Segmentation (GRES), which expands RES to a collection of related images, allowing the described objects to be present in a subset of input images. To support this new setting, we introduce an elaborately compiled dataset named Grouped Referring Dataset (GRD), containing complete group-wise annotations of target objects described by given expressions. We also present a baseline method named Grouped Referring Segmenter (GRSer), which explicitly captures the language-vision and intra-group vision-vision interactions to achieve state-of-the-art results on the proposed GRES and related tasks, such as Co-Salient Object Detection and RES. Our dataset and codes will be publicly released in https://github.com/yixuan730/group-res.