Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task
This work addresses visual question answering for AI systems, but it is incremental as it builds on existing methods like nCCA and CNN+LSTM.
The paper tackles the Visual Madlibs task by introducing Mean Box Pooling, a visual representation that pools over CNN features from many overlapping object proposals, combined with nCCA for multimodal embedding, achieving state-of-the-art performance. It also extends CNN+LSTM training to maximize similarity between internal representations and answers, leading to significant improvements over prior work.
We present Mean Box Pooling, a novel visual representation that pools over CNN representations of a large number, highly overlapping object proposals. We show that such representation together with nCCA, a successful multimodal embedding technique, achieves state-of-the-art performance on the Visual Madlibs task. Moreover, inspired by the nCCA's objective function, we extend classical CNN+LSTM approach to train the network by directly maximizing the similarity between the internal representation of the deep learning architecture and candidate answers. Again, such approach achieves a significant improvement over the prior work that also uses CNN+LSTM approach on Visual Madlibs.