CLCVNov 9, 2020

Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze

arXiv:2011.04592v1996 citations
AI Analysis

This work addresses the challenge of improving image captioning systems by incorporating human cognitive processes, representing an incremental advancement over existing state-of-the-art methods.

The paper tackled the problem of generating image descriptions by modeling sequential cross-modal alignment using human gaze patterns, resulting in descriptions that are better aligned with human speakers, more diverse, and more natural, particularly when gaze is encoded with a dedicated recurrent component.

When speakers describe an image, they tend to look at objects before mentioning them. In this paper, we investigate such sequential cross-modal alignment by modelling the image description generation process computationally. We take as our starting point a state-of-the-art image captioning system and develop several model variants that exploit information from human gaze patterns recorded during language production. In particular, we propose the first approach to image description generation where visual processing is modelled $\textit{sequentially}$. Our experiments and analyses confirm that better descriptions can be obtained by exploiting gaze-driven attention and shed light on human cognitive processes by comparing different ways of aligning the gaze modality with language production. We find that processing gaze data sequentially leads to descriptions that are better aligned to those produced by speakers, more diverse, and more natural${-}$particularly when gaze is encoded with a dedicated recurrent component.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes