CVAICLMMSDASDec 3, 2024

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

arXiv:2412.02611v10.1735 citationsh-index: 13Has Code
AI Analysis50

This provides a benchmark for evaluating MLLMs' audio-visual understanding, which is incremental as it builds on existing multimodal testing but focuses on specific weaknesses.

The paper tackles the problem that multimodal large language models (MLLMs) often fail at simple audio-visual tasks like comparing loudness or pitch, and introduces AV-Odyssey Bench, a benchmark with 4,555 problems to assess their understanding, revealing limitations in current models.

Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure precise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for human evaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source models and summarize the observations. By revealing the limitations of current models, we aim to provide useful insight for future dataset collection and model development.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes