CV AI CL LGMar 29, 2024

Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models

Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, Go Irie, Yixuan Li, Hai Li, Ziwei Liu, Kiyoharu Aizawa

arXiv:2403.20331v415.814 citationsh-index: 23Has CodeACL

Originality Incremental advance

AI Analysis

This addresses the need for more reliable evaluation of LMMs' trustworthiness, particularly for researchers and developers, though it is incremental as it builds on existing benchmark methods.

This paper tackles the problem of evaluating robust understanding in Large Multimodal Models (LMMs) by introducing the Unsolvable Problem Detection (UPD) task, which assesses their ability to withhold answers when faced with unsolvable multiple-choice questions, and finds that most LMMs struggle significantly on the new MM-UPD benchmark.

This paper introduces a novel task to evaluate the robust understanding capability of Large Multimodal Models (LMMs), termed $\textbf{Unsolvable Problem Detection (UPD)}$. Multiple-choice question answering (MCQA) is widely used to assess the understanding capability of LMMs, but it does not guarantee that LMMs truly comprehend the answer. UPD assesses the LMM's ability to withhold answers when encountering unsolvable problems of MCQA, verifying whether the model truly understands the answer. UPD encompasses three problems: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD), covering unsolvable cases like answer-lacking or incompatible choices and image-question mismatches. For the evaluation, we introduce the MM-UPD Bench, a benchmark for assessing performance across various ability dimensions. Our experiments reveal that even most LMMs, which demonstrate adequate performance on existing benchmarks, struggle significantly with MM-UPD, underscoring a novel aspect of trustworthiness that current benchmarks have overlooked. A detailed analysis shows that LMMs have different bottlenecks and chain-of-thought and self-reflection improved performance for LMMs with the bottleneck in their LLM capability. We hope our insights will enhance the broader understanding and development of more reliable LMMs. The code is available at https://github.com/AtsuMiyai/UPD.

View on arXiv PDF Code

Similar