CVMMSDASApr 2, 2025

Aligned Better, Listen Better for Audio-Visual Large Language Models

arXiv:2504.02061v111 citationsh-index: 9ICLR
Originality Incremental advance
AI Analysis

This addresses the need for better multimodal video understanding in AI systems, though it appears incremental as it builds on existing AV-LLM frameworks.

The paper tackles the problem of weak audio understanding and hallucinations in audio-visual large language models by proposing Dolphin, a fine-grained AV-LLM with concurrent temporal and spatial alignment, and curating the AVU dataset with 5.2 million data tuples, achieving remarkable performance improvements and reduced hallucinations.

Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak understanding and hallucinations. To solve the issues, we delve into the model architecture and dataset. (1) From the architectural perspective, we propose a fine-grained AV-LLM, namely Dolphin. The concurrent alignment of audio and visual modalities in both temporal and spatial dimensions ensures a comprehensive and accurate understanding of videos. Specifically, we devise an audio-visual multi-scale adapter for multi-scale information aggregation, which achieves spatial alignment. For temporal alignment, we propose audio-visual interleaved merging. (2) From the dataset perspective, we curate an audio-visual caption and instruction-tuning dataset, called AVU. It comprises 5.2 million diverse, open-ended data tuples (video, audio, question, answer) and introduces a novel data partitioning strategy. Extensive experiments show our model not only achieves remarkable performance in audio-visual understanding, but also mitigates potential hallucinations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes