CV AI LG ASJul 1, 2024

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha

arXiv:2407.01851v222.133 citationsh-index: 24Has Code

Originality Incremental advance

AI Analysis

This addresses the need for precise spatial and temporal grounding in audio-visual AI systems, representing a significant but incremental advance over existing coarse-grained methods.

The authors tackled the problem of fine-grained audio-visual understanding in multi-modal LLMs by introducing Meerkat, which achieved state-of-the-art performance with up to 37.12% relative improvement on tasks like audio referred image grounding and audio-visual fact-checking.

Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.

View on arXiv PDF Code

Similar