ASCVMMSDSep 19, 2023

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Meta AIMIT
arXiv:2309.10787v225 citationsh-index: 83
Originality Synthesis-oriented
AI Analysis

This provides a standardized evaluation framework for researchers in audio-visual learning, though it is incremental as it builds on existing benchmark concepts.

The authors tackled the problem of evaluating audio-visual representation models by proposing AV-SUPERB, a benchmark covering 7 datasets and 5 tasks, and found that none of 5 recent self-supervised models generalized to all tasks, highlighting the need for improved universal models.

Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations on 7 datasets covering 5 audio-visual tasks in speech and audio processing. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks, emphasizing the need for future study on improving universal model performance. In addition, we show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task. We release our benchmark with evaluation code and a model submission platform to encourage further research in audio-visual learning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes