CVCLOct 20, 2022

Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation

arXiv:2210.11109v2291 citationsh-index: 131
Originality Incremental advance
AI Analysis

This work addresses the problem of generating controlled spatial-oriented descriptions for images, which is incremental as it builds on existing image-to-text tasks by focusing on spatial semantics.

The authors introduced Visual Spatial Description (VSD), a task for generating text descriptions focused on spatial relationships between two objects in an image, and developed benchmark models using VL-BART and VL-T5 backbones, showing that joint end-to-end architectures with visual spatial relationship classification improve accuracy and human-like quality.

Image-to-text tasks, such as open-ended image captioning and controllable image description, have received extensive attention for decades. Here, we further advance this line of work by presenting Visual Spatial Description (VSD), a new perspective for image-to-text toward spatial semantics. Given an image and two objects inside it, VSD aims to produce one description focusing on the spatial perspective between the two objects. Accordingly, we manually annotate a dataset to facilitate the investigation of the newly-introduced task and build several benchmark encoder-decoder models by using VL-BART and VL-T5 as backbones. In addition, we investigate pipeline and joint end-to-end architectures for incorporating visual spatial relationship classification (VSRC) information into our model. Finally, we conduct experiments on our benchmark dataset to evaluate all our models. Results show that our models are impressive, providing accurate and human-like spatial-oriented text descriptions. Meanwhile, VSRC has great potential for VSD, and the joint end-to-end architecture is the better choice for their integration. We make the dataset and codes public for research purposes.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes