Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional
This work addresses the challenge of principled benchmark design and evaluation for multi-modal learning, which is crucial for researchers and practitioners in AI, though it is incremental as it provides characterization rather than a new method.
The paper tackled the problem of poorly characterized intra- and inter-modality dependencies in multi-modal learning by conducting a large-scale empirical study across 23 visual question-answering benchmarks using MLLMs, finding that reliance on vision, text, and their interaction varies significantly and that many benchmarks intended to reduce text-only biases have amplified image-only dependencies, with larger models often masking a lack of multi-modal reasoning.
Understanding the interplay between intra-modality dependencies (the contribution of an individual modality to a target task) and inter-modality dependencies (the relationships between modalities and the target task) is fundamental to advancing multi-modal learning. However, the nature of and interaction between these dependencies within current benchmark evaluations remains poorly characterized. In this work, we present a large-scale empirical study to quantify these dependencies across 23 visual question-answering benchmarks using multi-modal large language models (MLLMs) covering domains such as general and expert knowledge reasoning, optical character recognition, and document understanding. Our findings show that the reliance on vision, question (text), and their interaction varies significantly, both across and within benchmarks. We discover that numerous benchmarks intended to mitigate text-only biases have inadvertently amplified image-only dependencies. This characterization persists across model sizes, as larger models often use these intra-modality dependencies to achieve high performance that mask an underlying lack of multi-modal reasoning. We provide a quantitative characterization of multi-modal datasets, enabling a principled approach to multi-modal benchmark design and evaluation.