CV AI CL LGAug 27, 2016

Learning to generalize to new compositions in image understanding

Yuval Atzmon, Jonathan Berant, Vahid Kezami, Amir Globerson, Gal Chechik

arXiv:1608.07639v118.171 citations

Originality Incremental advance

AI Analysis

This addresses the issue for image captioning systems that fail to handle novel compositions, advocating for compositional models and benchmarks to improve generalization.

The paper tackles the problem of poor generalization in image captioning models to unseen scene compositions by proposing structured representations to separate and evaluate two types of generalization. The structured model achieved ~7 times better accuracy in predicting structured representations for new combinations compared to a state-of-the-art LSTM-based method on the MS-COCO dataset.

Recurrent neural networks have recently been used for learning to describe images using natural language. However, it has been observed that these models generalize poorly to scenes that were not observed during training, possibly depending too strongly on the statistics of the text in the training data. Here we propose to describe images using short structured representations, aiming to capture the crux of a description. These structured representations allow us to tease-out and evaluate separately two types of generalization: standard generalization to new images with similar scenes, and generalization to new combinations of known entities. We compare two learning approaches on the MS-COCO dataset: a state-of-the-art recurrent network based on an LSTM (Show, Attend and Tell), and a simple structured prediction model on top of a deep network. We find that the structured model generalizes to new compositions substantially better than the LSTM, ~7 times the accuracy of predicting structured representations. By providing a concrete method to quantify generalization for unseen combinations, we argue that structured representations and compositional splits are a useful benchmark for image captioning, and advocate compositional models that capture linguistic and visual structure.

View on arXiv PDF

Similar