Understanding Visual Ads by Aligning Symbols and Objects using Co-Attention
This addresses the challenge of interpreting semantic and symbolic nuances in visual ads for applications like advertising analysis, but it is incremental as it builds on existing embedding methods with attention refinements.
The paper tackles the problem of understanding visual ads by ranking human-generated statements describing ad purposes, using a multihop co-attention mechanism to align image proposals with symbols for better embeddings, and shows that the model outperforms baselines on a benchmark dataset.
We tackle the problem of understanding visual ads where given an ad image, our goal is to rank appropriate human generated statements describing the purpose of the ad. This problem is generally addressed by jointly embedding images and candidate statements to establish correspondence. Decoding a visual ad requires inference of both semantic and symbolic nuances referenced in an image and prior methods may fail to capture such associations especially with weakly annotated symbols. In order to create better embeddings, we leverage an attention mechanism to associate image proposals with symbols and thus effectively aggregate information from aligned multimodal representations. We propose a multihop co-attention mechanism that iteratively refines the attention map to ensure accurate attention estimation. Our attention based embedding model is learned end-to-end guided by a max-margin loss function. We show that our model outperforms other baselines on the benchmark Ad dataset and also show qualitative results to highlight the advantages of using multihop co-attention.