MMCLIRSDASMar 12, 2025

Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model

arXiv:2503.09205v43 citationsh-index: 14EUSIPCO
Originality Incremental advance
AI Analysis

This work addresses the problem of data efficiency in audio-video multimodal models for researchers and practitioners, though it appears incremental as it builds on existing methods like DenseAV with a novel curation approach.

The paper tackles the challenge of integrating audio and visual data for multimodal foundation models by proposing the Audio-Video Vector Alignment (AVVA) framework, which uses LLM-based curation to select aligned training data and achieves significant improvements in top-k accuracies for video-to-audio retrieval on datasets like AudioCaps, VALOR, and VGGSound while using only 192 hours of curated data.

Integrating audio and visual data for training multimodal foundational models remains a challenge. The Audio-Video Vector Alignment (AVVA) framework addresses this by considering AV scene alignment beyond mere temporal synchronization, and leveraging Large Language Models (LLMs) for data curation. AVVA implements a scoring mechanism for selecting aligned training data segments. It integrates Whisper, a speech-based foundation model, for audio and DINOv2 for video analysis in a dual-encoder structure with contrastive learning on AV pairs. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate the effectiveness of the proposed model architecture and data curation approach. AVVA achieves a significant improvement in top-k accuracies for video-to-audio retrieval on all datasets compared to DenseAV, while using only 192 hrs of curated training data. Furthermore, an ablation study indicates that the data curation process effectively trades data quality for data quantity, yielding increases in top-k retrieval accuracies on AudioCaps, VALOR, and VGGSound, compared to training on the full spectrum of uncurated data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes