Do You See What I Mean? Visual Resolution of Linguistic Ambiguities
This addresses the challenge of grounded language understanding for AI systems by providing a visual resolution method, though it is incremental as it builds on existing vision models.
The paper tackles the problem of disambiguating ambiguous sentences using visual scenes, by introducing a new multimodal corpus with videos for different interpretations and extending a vision model to recognize these interpretations, achieving unified disambiguation across various ambiguity types.
Understanding language goes hand in hand with the ability to integrate complex contextual information obtained via perception. In this work, we present a novel task for grounded language understanding: disambiguating a sentence given a visual scene which depicts one of the possible interpretations of that sentence. To this end, we introduce a new multimodal corpus containing ambiguous sentences, representing a wide range of syntactic, semantic and discourse ambiguities, coupled with videos that visualize the different interpretations for each sentence. We address this task by extending a vision model which determines if a sentence is depicted by a video. We demonstrate how such a model can be adjusted to recognize different interpretations of the same underlying sentence, allowing to disambiguate sentences in a unified fashion across the different ambiguity types.