CVJun 27, 2024

Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

arXiv:2406.18849v44 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the need for robust evaluation benchmarks for LVLMs, though it is incremental as it builds on existing benchmark concepts with new features.

The authors tackled the problem of evaluating perception ability in Large Vision-Language Models (LVLMs) by proposing Dysca, a dynamic and scalable benchmark that uses synthesis images to avoid data leakage and covers multi-stylized images and noisy scenarios, revealing drawbacks in current LVLMs.

Currently many benchmarks have been proposed to evaluate the perception ability of the Large Vision-Language Models (LVLMs). However, most benchmarks conduct questions by selecting images from existing datasets, resulting in the potential data leakage. Besides, these benchmarks merely focus on evaluating LVLMs on the realistic style images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. In response to these challenges, we propose a dynamic and scalable benchmark named Dysca for evaluating LVLMs by leveraging synthesis images. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and the corresponding answers. We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks. Moreover, we conduct evaluations under 4 scenarios (i.e., Clean, Corruption, Print Attacking and Adversarial Attacking) and 3 question types (i.e., Multi-choices, True-or-false and Free-form). Thanks to the generative paradigm, Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios. A total of 24 advanced open-source LVLMs and 2 close-source LVLMs are evaluated on Dysca, revealing the drawbacks of current LVLMs. The benchmark is released at https://github.com/Robin-WZQ/Dysca.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes