CV AI CL CY LGJan 6, 2025

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, Ludwig Schmidt, Serena Yeung-Levy

Stanford

arXiv:2501.03225v228.935 citationsh-index: 19Has CodeCVPR

Originality Incremental advance

AI Analysis

This provides a scalable and objective evaluation method for VLM researchers, though it is incremental as it builds on existing VQA datasets.

The paper tackles the difficulty of evaluating vision language models (VLMs) due to variability in open-ended questions by introducing AutoConverter, an agentic framework that automatically converts open-ended visual question answering (VQA) questions into multiple-choice format, resulting in the creation of VMCBench with 9,018 questions and showing VLMs achieve similar or lower accuracy on these generated questions compared to human-created ones.

The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly multiple-choice question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.

View on arXiv PDF Code

Similar