CV AI CLFeb 23, 2023

Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, Ming-Wei Chang

DeepMindGeorgia Tech

arXiv:2302.11713v542.1231 citationsh-index: 38Has Code

Originality Incremental advance

AI Analysis

This addresses the limitation of current multi-modal models in handling complex, information-seeking visual questions, which is an incremental advancement for AI applications requiring detailed visual knowledge.

The study tackled the problem of whether pre-trained vision and language models can answer knowledge-intensive visual information-seeking questions, and found that state-of-the-art models like PaLI-X and BLIP2 struggle but fine-tuning on the new InfoSeek dataset improves performance by eliciting fine-grained knowledge.

Pre-trained vision and language models have demonstrated state-of-the-art capabilities over existing tasks involving images and texts, including visual question answering. However, it remains unclear whether these models possess the capability to answer questions that are not only querying visual content but knowledge-intensive and information-seeking. In this study, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with only common sense knowledge. Using InfoSeek, we analyze various pre-trained visual question answering models and gain insights into their characteristics. Our findings reveal that state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.) face challenges in answering visual information-seeking questions, but fine-tuning on the InfoSeek dataset elicits models to use fine-grained knowledge that was learned during their pre-training. Furthermore, we show that accurate visual entity recognition can be used to improve performance on InfoSeek by retrieving relevant documents, showing a significant space for improvement.

View on arXiv PDF Code

Similar