Grounding Visual Explanations (Extended Abstract)
This work addresses the issue of ensuring image relevance in visual explanation generation, which is incremental as it builds on existing discriminative models by adding grounding mechanisms.
The paper tackles the problem of weakly constrained object part mentions in textual explanations by proposing a model that grounds constituent phrases in images, resulting in improved explanation relevance through a phrase-critic model with a relative-attribute ranking loss.
Existing models which generate textual explanations enforce task relevance through a discriminative term loss function, but such mechanisms only weakly constrain mentioned object parts to actually be present in the image. In this paper, a new model is proposed for generating explanations by utilizing localized grounding of constituent phrases in generated explanations to ensure image relevance. Specifically, we introduce a phrase-critic model to refine (re-score/re-rank) generated candidate explanations and employ a relative-attribute inspired ranking loss using "flipped" phrases as negative examples for training. At test time, our phrase-critic model takes an image and a candidate explanation as input and outputs a score indicating how well the candidate explanation is grounded in the image.