Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding
This addresses the need for robust natural language understanding by reducing reliance on textual biases, though it is incremental as it builds on existing multimodal techniques.
The paper tackles the problem of Natural Language Inference (NLI) by proposing a zero-shot method that grounds language in visual contexts using text-to-image models, achieving high accuracy without task-specific fine-tuning and demonstrating robustness against textual biases.
We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.