CVCLFeb 28, 2015

Generating Multi-Sentence Lingual Descriptions of Indoor Scenes

arXiv:1503.00064v127 citations
Originality Incremental advance
AI Analysis

This addresses the limitation of single-sentence descriptions in scene understanding, providing more detailed and coherent outputs for applications in robotics or accessibility.

The paper tackled the problem of generating multi-sentence lingual descriptions for complex indoor scenes, achieving substantially higher ROGUE scores compared to baseline methods on the augmented NYU-v2 dataset.

This paper proposes a novel framework for generating lingual descriptions of indoor scenes. Whereas substantial efforts have been made to tackle this problem, previous approaches focusing primarily on generating a single sentence for each image, which is not sufficient for describing complex scenes. We attempt to go beyond this, by generating coherent descriptions with multiple sentences. Our approach is distinguished from conventional ones in several aspects: (1) a 3D visual parsing system that jointly infers objects, attributes, and relations; (2) a generative grammar learned automatically from training text; and (3) a text generation algorithm that takes into account the coherence among sentences. Experiments on the augmented NYU-v2 dataset show that our framework can generate natural descriptions with substantially higher ROGUE scores compared to those produced by the baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes