CLFeb 17, 2024

Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, Zhifang Sui

Peking U

arXiv:2402.11281v317.935 citationsh-index: 38Has CodeACL

Originality Synthesis-oriented

AI Analysis

This work addresses the deficiency in assessing deep semantic understanding in LMMs for applications like social media analysis, but it is incremental as it focuses on benchmarking rather than developing new methods.

The authors tackled the problem of evaluating Large Multimodal Models' (LMMs) ability to understand deep semantics in images by introducing DEEPEVAL, a benchmark with human-annotated data and three subtasks, and found that GPT-4V is 30% behind humans in deep semantic comprehension despite matching human performance in image description.

Understanding the deep semantics of images is essential in the era dominated by social media. However, current research works primarily on the superficial description of images, revealing a notable deficiency in the systematic investigation of the inherent deep semantics. In this work, we introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models' (LMMs) capacities of visual deep semantics. DEEPEVAL includes human-annotated dataset and three progressive subtasks: fine-grained description selection, in-depth title matching, and deep semantics understanding. Utilizing DEEPEVAL, we evaluate 9 open-source LMMs and GPT-4V(ision). Our evaluation demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description. Further analysis reveals that LMM performance on DEEPEVAL varies according to the specific facets of deep semantics explored, indicating the fundamental challenges remaining in developing LMMs.

View on arXiv PDF Code

Similar