CAST: Cross-modal Alignment Similarity Test for Vision Language Models
This addresses the need for better evaluation of VLMs to ensure reliability in broader tasks requiring visual and language inputs, though it is incremental as it builds on existing VLM evaluation methods.
The paper tackles the problem that Visual Question Answering (VQA) tasks fail to fully capture biases or hallucinations in Vision Language Models (VLMs) due to misalignment between modalities, and proposes the Cross-modal Alignment Similarity Test (CAST) to probe VLMs for self-consistency across modalities, focusing on internal consistency rather than objective accuracy.
Vision Language Models (VLMs) are typically evaluated with Visual Question Answering (VQA) tasks which assess a model's understanding of scenes. Good VQA performance is taken as evidence that the model will perform well on a broader range of tasks that require both visual and language inputs. However, scene-aware VQA does not fully capture input biases or assess hallucinations caused by a misalignment between modalities. To address this, we propose a Cross-modal Alignment Similarity Test (CAST) to probe VLMs for self-consistency across modalities. This test involves asking the models to identify similarities between two scenes through text-only, image-only, or both and then assess the truthfulness of the similarities they generate. Since there is no ground-truth to compare against, this evaluation does not focus on objective accuracy but rather on whether VLMs are internally consistent in their outputs. We argue that while not all self-consistent models are capable or accurate, all capable VLMs must be self-consistent.