CVAICLLGAug 9, 2016

Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task

arXiv:1608.02717v15 citations
Originality Incremental advance
AI Analysis

This work addresses visual question answering for AI systems, but it is incremental as it builds on existing methods like nCCA and CNN+LSTM.

The paper tackles the Visual Madlibs task by introducing Mean Box Pooling, a visual representation that pools over CNN features from many overlapping object proposals, combined with nCCA for multimodal embedding, achieving state-of-the-art performance. It also extends CNN+LSTM training to maximize similarity between internal representations and answers, leading to significant improvements over prior work.

We present Mean Box Pooling, a novel visual representation that pools over CNN representations of a large number, highly overlapping object proposals. We show that such representation together with nCCA, a successful multimodal embedding technique, achieves state-of-the-art performance on the Visual Madlibs task. Moreover, inspired by the nCCA's objective function, we extend classical CNN+LSTM approach to train the network by directly maximizing the similarity between the internal representation of the deep learning architecture and candidate answers. Again, such approach achieves a significant improvement over the prior work that also uses CNN+LSTM approach on Visual Madlibs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes