Jointly Learning Truth-Conditional Denotations and Groundings using Parallel Attention
This work addresses visual question answering by enabling models to learn flexible groundings from question-answer pairs, but it is incremental as it builds on prior neurosymbolic approaches.
The paper tackles the problem of jointly learning word denotations and object groundings for visual question answering using a truth-conditional semantics, achieving state-of-the-art performance on the CLEVR dataset.
We present a model that jointly learns the denotations of words together with their groundings using a truth-conditional semantics. Our model builds on the neurosymbolic approach of Mao et al. (2019), learning to ground objects in the CLEVR dataset (Johnson et al., 2017) using a novel parallel attention mechanism. The model achieves state of the art performance on visual question answering, learning to detect and ground objects with question performance as the only training signal. We also show that the model is able to learn flexible non-canonical groundings just by adjusting answers to questions in the training set.