VGGSounder: Audio-Visual Evaluations for Foundation Models
This work addresses evaluation challenges for researchers and developers working on audio-visual foundation models, though it is incremental as it builds upon an existing dataset.
The authors tackled the problem of unreliable evaluation of audio-visual foundation models by identifying limitations in the VGGSound dataset, such as incomplete labeling and misaligned modalities, and introduced VGGSounder, a re-annotated multi-label test set that enables precise modality-specific analyses and reveals model limitations through a new modality confusion metric.
The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.