Recurrent Instance Segmentation using Sequences of Referring Expressions
This addresses the challenge of precise object segmentation using sequential language inputs for applications in computer vision and human-computer interaction, representing an incremental advancement in multimodal methods.
The paper tackles the problem of segmenting objects in an image based on a sequence of linguistic descriptions, proposing a recurrent neural network that outputs binary masks for each expression. Experiments on the RefCOCO dataset show the architecture successfully leverages expression sequences for instance segmentation.
The goal of this work is to segment the objects in an image that are referred to by a sequence of linguistic descriptions (referring expressions). We propose a deep neural network with recurrent layers that output a sequence of binary masks, one for each referring expression provided by the user. The recurrent layers in the architecture allow the model to condition each predicted mask on the previous ones, from a spatial perspective within the same image. Our multimodal approach uses off-the-shelf architectures to encode both the image and the referring expressions. The visual branch provides a tensor of pixel embeddings that are concatenated with the phrase embeddings produced by a language encoder. Our experiments on the RefCOCO dataset for still images indicate how the proposed architecture successfully exploits the sequences of referring expressions to solve a pixel-wise task of instance segmentation.