"See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models
This addresses the problem of assessing knowledge reliability in vision-language models for Chinese-language applications, though it is incremental as it adapts existing evaluation concepts to a new language.
The authors tackled the lack of factual accuracy evaluation for large vision language models by introducing ChineseSimpleVQA, the first Chinese factuality-based visual question-answering benchmark, and found critical performance gaps among 34 models across 8 topics.
The evaluation of factual accuracy in large vision language models (LVLMs) has lagged behind their rapid development, making it challenging to fully reflect these models' knowledge capacity and reliability. In this paper, we introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA, aimed at assessing the visual factuality of LVLMs across 8 major topics and 56 subtopics. The key features of this benchmark include a focus on the Chinese language, diverse knowledge types, a multi-hop question construction, high-quality data, static consistency, and easy-to-evaluate through short answers. Moreover, we contribute a rigorous data construction pipeline and decouple the visual factuality into two parts: seeing the world (i.e., object recognition) and discovering knowledge. This decoupling allows us to analyze the capability boundaries and execution mechanisms of LVLMs. Subsequently, we evaluate 34 advanced open-source and closed-source models, revealing critical performance gaps within this field. Our evaluation-friendly code and data have already been open-sourced.