Resampling Benchmark for Efficient Comprehensive Evaluation of Large Vision-Language Models
This provides a more efficient evaluation protocol for researchers and developers working with large vision-language models, though it is an incremental improvement on existing evaluation methods.
The authors tackled the problem of computationally expensive comprehensive evaluation of large vision-language models by proposing an efficient subset construction method using farthest point sampling, which maintains over 0.96 correlation with full evaluations while using only about 1% of the data.
We propose an efficient evaluation protocol for large vision-language models (VLMs). Given their broad knowledge and reasoning capabilities, multiple benchmarks are needed for comprehensive assessment, making evaluation computationally expensive. To improve efficiency, we construct a subset that yields results comparable to full benchmark evaluations. Our benchmark classification experiments reveal that no single benchmark fully covers all challenges. We then introduce a subset construction method using farthest point sampling (FPS). Our experiments show that FPS-based benchmarks maintain a strong correlation (> 0.96) with full evaluations while using only ~1\% of the data. Additionally, applying FPS to an existing benchmark improves correlation with overall evaluation results, suggesting its potential to reduce unintended dataset biases.