BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models
This addresses the need for more targeted and cost-effective evaluation of VLMs, shifting from coarse benchmarks to diagnostic testing, though it is incremental in its approach to evaluation methods.
The paper tackles the problem of evaluating Visual Language Models (VLMs) by proposing a new methodology that uses procedurally generated synthetic images to systematically test and analyze perception failures, enabling fine-grained and interpretable assessment.
Visual Language Models (VLMs) are now sufficiently advanced to support a broad range of applications, including answering complex visual questions, and are increasingly expected to interact with images in varied ways. To evaluate them, current benchmarks often focus on specific domains (e.g., reading charts), constructing datasets of annotated real images paired with pre-defined Multiple Choice Questions (MCQs) to report aggregate accuracy scores. However, such benchmarks entail high annotation costs, risk information leakage, and do not clarify whether failures stem from limitations in visual perception, reasoning, or general knowledge. We propose a new evaluation methodology, inspired by ophthalmologic diagnostics, leveraging procedural generation of synthetic images to obtain control over visual attributes and precisely reveal perception failures in VLMs. Specifically, we build collections of images with gradually more challenging variations in the content of interest (e.g., number of objects in a counting task) while holding other visual parameters constant. This diagnostic allows systematic stress testing and fine-grained failure analysis, shifting the focus from coarse benchmarking toward targeted and interpretable assessment of VLM capabilities. Our code is available at https://github.com/byoeval/BYO-EVAL.