CVAICLLGFeb 11, 2023

Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis

arXiv:2302.05608v11 citationsh-index: 15
Originality Incremental advance
AI Analysis

This addresses robustness in multimodal AI for applications like visual reasoning, though it appears incremental as it builds on existing vision-language models.

The paper tackles the problem of deep multimodal models failing to capture semantic information and dependencies in noisy settings by proposing an end-to-end vision and language model with explicit knowledge graphs and an interactive OOD layer for noise filtering. The result is models that achieve state-of-the-art performance on tasks like visual question answering and image-text retrieval with significantly fewer samples and training time.

Often, deep network models are purely inductive during training and while performing inference on unseen data. Thus, when such models are used for predictions, it is well known that they often fail to capture the semantic information and implicit dependencies that exist among objects (or concepts) on a population level. Moreover, it is still unclear how domain or prior modal knowledge can be specified in a backpropagation friendly manner, especially in large-scale and noisy settings. In this work, we propose an end-to-end vision and language model incorporating explicit knowledge graphs. We also introduce an interactive out-of-distribution (OOD) layer using implicit network operator. The layer is used to filter noise that is brought by external knowledge base. In practice, we apply our model on several vision and language downstream tasks including visual question answering, visual reasoning, and image-text retrieval on different datasets. Our experiments show that it is possible to design models that perform similarly to state-of-art results but with significantly fewer samples and training time.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes