CVMLMay 10, 2021

Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning

arXiv:2105.04143v2Has Code
Originality Incremental advance
AI Analysis

This work addresses image paragraph captioning for applications like accessibility and content generation, but it is incremental as it builds on existing topic integration methods.

The paper tackles the problem of generating semantically coherent paragraph captions for images by integrating hierarchical semantic topics into the language model, resulting in models that are competitive with state-of-the-art approaches on standard metrics and can produce diverse and coherent captions.

Observing a set of images and their corresponding paragraph-captions, a challenging task is to learn how to produce a semantically coherent paragraph to describe the visual content of an image. Inspired by recent successes in integrating semantic topics into this task, this paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework, which couples a visual extractor with a deep topic model to guide the learning of a language model. To capture the correlations between the image and text at multiple levels of abstraction and learn the semantic topics from images, we design a variational inference network to build the mapping from image features to textual captions. To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model, including Long Short-Term Memory (LSTM) and Transformer, and jointly optimized. Experiments on public datasets demonstrate that the proposed models, which are competitive with many state-of-the-art approaches in terms of standard evaluation metrics, can be used to both distill interpretable multi-layer semantic topics and generate diverse and coherent captions. We release our code at https://github.com/DandanGuo1993/VTCM-based-image-paragraph-caption.git

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes