CLCVAug 31, 2018

Learning to Describe Differences Between Pairs of Similar Images

arXiv:1808.10584v11132 citations
Originality Incremental advance
AI Analysis

This addresses the need for coherent multi-sentence generation in vision-language tasks, though it is incremental as it builds on existing methods with a new dataset and model.

The paper tackles the problem of automatically generating text to describe differences between pairs of similar images, using a crowd-sourced dataset from video-surveillance footage, and proposes a model that outperforms attention-based models in both single- and multi-sentence generation.

In this paper, we introduce the task of automatically generating text to describe the differences between two similar images. We collect a new dataset by crowd-sourcing difference descriptions for pairs of image frames extracted from video-surveillance footage. Annotators were asked to succinctly describe all the differences in a short paragraph. As a result, our novel dataset provides an opportunity to explore models that align language and vision, and capture visual salience. The dataset may also be a useful benchmark for coherent multi-sentence generation. We perform a firstpass visual analysis that exposes clusters of differing pixels as a proxy for object-level differences. We propose a model that captures visual salience by using a latent variable to align clusters of differing pixels with output sentences. We find that, for both single-sentence generation and as well as multi-sentence generation, the proposed model outperforms the models that use attention alone.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes