CVAug 18, 2016

Seeing with Humans: Gaze-Assisted Neural Image Captioning

arXiv:1608.05203v175 citations
Originality Incremental advance
AI Analysis

This addresses scene understanding for computer vision systems, but is incremental as it extends gaze applications from object-centric to scene-centric tasks.

The paper tackles whether human gaze can improve image captioning by studying its interplay with neural attention mechanisms, and shows that integrating gaze information into a split attention model improves performance on COCO/SALICON datasets.

Gaze reflects how humans process visual scenes and is therefore increasingly used in computer vision systems. Previous works demonstrated the potential of gaze for object-centric tasks, such as object localization and recognition, but it remains unclear if gaze can also be beneficial for scene-centric tasks, such as image captioning. We present a new perspective on gaze-assisted image captioning by studying the interplay between human gaze and the attention mechanism of deep neural networks. Using a public large-scale gaze dataset, we first assess the relationship between state-of-the-art object and scene recognition models, bottom-up visual saliency, and human gaze. We then propose a novel split attention model for image captioning. Our model integrates human gaze information into an attention-based long short-term memory architecture, and allows the algorithm to allocate attention selectively to both fixated and non-fixated image regions. Through evaluation on the COCO/SALICON datasets we show that our method improves image captioning performance and that gaze can complement machine attention for semantic scene understanding tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes