CVDec 29, 2016

Learning Visual N-Grams from Web Data

arXiv:1612.09161v2151 citations
AI Analysis

This addresses the challenge of scalable image recognition for real-world systems by reducing reliance on manual annotation, though it is incremental in applying language-inspired methods to vision.

The paper tackles the problem of recognizing tens of thousands of visual concepts in images by training on web data with user comments, resulting in visual n-gram models that predict relevant phrases and show merits in tasks like phrase prediction and zero-shot transfer.

Real-world image recognition systems need to recognize tens of thousands of classes that constitute a plethora of visual concepts. The traditional approach of annotating thousands of images per class for training is infeasible in such a scenario, prompting the use of webly supervised data. This paper explores the training of image-recognition systems on large numbers of images and associated user comments. In particular, we develop visual n-gram models that can predict arbitrary phrases that are relevant to the content of an image. Our visual n-gram models are feed-forward convolutional networks trained using new loss functions that are inspired by n-gram models commonly used in language modeling. We demonstrate the merits of our models in phrase prediction, phrase-based image retrieval, relating images and captions, and zero-shot transfer.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes