CVAICLApr 23, 2018

Object Counts! Bringing Explicit Detections Back into Image Captioning

arXiv:1805.00314v11110 citations
Originality Synthesis-oriented
AI Analysis

This provides interpretability insights for researchers in computer vision and NLP, though it is incremental as it builds on existing methods.

The paper tackles the problem of understanding why end-to-end image captioning systems work well by reintroducing explicit object detectors, revealing that these systems rely on matching image representations and that object frequency, size, and position are complementary factors in forming effective representations.

The use of explicit object detectors as an intermediate step to image captioning - which used to constitute an essential stage in early work - is often bypassed in the currently dominant end-to-end approaches, where the language model is conditioned directly on a mid-level image embedding. We argue that explicit detections provide rich semantic information, and can thus be used as an interpretable representation to better understand why end-to-end image captioning systems work well. We provide an in-depth analysis of end-to-end image captioning by exploring a variety of cues that can be derived from such object detections. Our study reveals that end-to-end image captioning systems rely on matching image representations to generate captions, and that encoding the frequency, size and position of objects are complementary and all play a role in forming a good image representation. It also reveals that different object categories contribute in different ways towards image captioning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes