CVCLLGJul 12, 2020

Sparse Graph to Sequence Learning for Vision Conditioned Long Textual Sequence Generation

arXiv:2007.06077v1
Originality Incremental advance
AI Analysis

This work addresses the challenge of generating coherent, story-like text from images, which is an incremental advancement over standard image captioning for applications in automated storytelling or detailed visual descriptions.

The paper tackles the problem of generating longer textual sequences from visual information by framing it as a graph-to-sequence learning problem and introducing the Sparse Graph-to-Sequence Transformer (SGST). It achieves a 13.3% improvement in CIDEr score compared to previous state-of-the-art methods on a benchmark image paragraph dataset.

Generating longer textual sequences when conditioned on the visual information is an interesting problem to explore. The challenge here proliferate over the standard vision conditioned sentence-level generation (e.g., image or video captioning) as it requires to produce a brief and coherent story describing the visual content. In this paper, we mask this Vision-to-Sequence as Graph-to-Sequence learning problem and approach it with the Transformer architecture. To be specific, we introduce Sparse Graph-to-Sequence Transformer (SGST) for encoding the graph and decoding a sequence. The encoder aims to directly encode graph-level semantics, while the decoder is used to generate longer sequences. Experiments conducted with the benchmark image paragraph dataset show that our proposed achieve 13.3% improvement on the CIDEr evaluation measure when comparing to the previous state-of-the-art approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes