CVMay 18, 2018

SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text

Alexander Mathews, Lexing Xie, Xuming He

arXiv:1805.07030v121.3119 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of creating stylized image captions for applications in vision and language, offering a method to leverage unaligned text data, though it is incremental in improving over existing approaches that require aligned data or produce low relevance.

The paper tackled the problem of generating image captions that are both visually grounded and appropriately styled, without requiring aligned training data, and achieved results where captions preserved image semantics and were style-shifted as evaluated automatically and manually.

Linguistic style is an essential part of written communication, with the power to affect both clarity and attractiveness. With recent advances in vision and language, we can start to tackle the problem of generating image captions that are both visually grounded and appropriately styled. Existing approaches either require styled training captions aligned to images or generate captions with low relevance. We develop a model that learns to generate visually relevant styled captions from a large corpus of styled text without aligned images. The core idea of this model, called SemStyle, is to separate semantics and style. One key component is a novel and concise semantic term representation generated using natural language processing techniques and frame semantics. In addition, we develop a unified language model that decodes sentences with diverse word choices and syntax for different styles. Evaluations, both automatic and manual, show captions from SemStyle preserve image semantics, are descriptive, and are style shifted. More broadly, this work provides possibilities to learn richer image descriptions from the plethora of linguistic data available on the web.

View on arXiv PDF Code

Similar