MMAISDASAug 25, 2024

Unveiling Visual Biases in Audio-Visual Localization Benchmarks

arXiv:2409.06709v11 citationsh-index: 7
Originality Synthesis-oriented
AI Analysis

This work highlights a critical flaw in benchmarks for audio-visual learning, which could mislead research progress in this domain.

The paper identifies that existing audio-visual source localization benchmarks suffer from visual biases, where sounding objects can be recognized using only visual cues, undermining model evaluation. It shows that vision-only models outperform audiovisual baselines on VGG-SS and EpicSounding-Object benchmarks, indicating the need for benchmark refinement.

Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias. Such biases hinder these benchmarks from effectively evaluating AVSL models. To further validate our hypothesis regarding visual biases, we examine two representative AVSL benchmarks, VGG-SS and EpicSounding-Object, where the vision-only models outperform all audiovisual baselines. Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes