CVCLFeb 3, 2021

L2C: Describing Visual Differences Needs Semantic Understanding of Individuals

arXiv:2102.01860v1804 citations
Originality Incremental advance
AI Analysis

This work is an incremental improvement for researchers working on visual difference captioning by introducing a method that leverages semantic understanding of individual images.

This paper addresses the task of generating descriptions of visual differences between image pairs. The authors propose the Learning-to-Compare (L2C) model, which semantically understands and compares two images while also learning to describe each individually. This approach outperforms the baseline on the Birds-to-Words dataset in both automatic and human evaluations.

Recent advances in language and vision push forward the research of captioning a single image to describing visual differences between image pairs. Suppose there are two images, I_1 and I_2, and the task is to generate a description W_{1,2} comparing them, existing methods directly model { I_1, I_2 } -> W_{1,2} mapping without the semantic understanding of individuals. In this paper, we introduce a Learning-to-Compare (L2C) model, which learns to understand the semantic structures of these two images and compare them while learning to describe each one. We demonstrate that L2C benefits from a comparison between explicit semantic representations and single-image captions, and generalizes better on the new testing image pairs. It outperforms the baseline on both automatic evaluation and human evaluation for the Birds-to-Words dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes