CVCLMMJul 21, 2024

Audio-visual training for improved grounding in video-text LLMs

arXiv:2407.15046v123 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the under-explored role of audio in video understanding for multimodal AI applications, representing an incremental improvement over existing methods.

The paper tackles the problem of video-text LLMs largely ignoring audio signals by proposing an audio-visual training approach, resulting in improved grounding of responses compared to vision-only and other audio-visual baselines.

Recent advances in multimodal LLMs, have led to several video-text models being proposed for critical video-related tasks. However, most of the previous works support visual input only, essentially muting the audio signal in the video. Few models that support both audio and visual input, are not explicitly trained on audio data. Hence, the effect of audio towards video understanding is largely unexplored. To this end, we propose a model architecture that handles audio-visual inputs explicitly. We train our model with both audio and visual data from a video instruction-tuning dataset. Comparison with vision-only baselines, and other audio-visual models showcase that training on audio data indeed leads to improved grounding of responses. For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset, with audio-aware question-answer pairs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes