MMCVSDASApr 3, 2025

Multifaceted Evaluation of Audio-Visual Capability for MLLMs: Effectiveness, Efficiency, Generalizability and Robustness

Peking U
arXiv:2504.16936v12 citationsh-index: 30EMNLP
Originality Synthesis-oriented
AI Analysis

This work addresses the need for better evaluation benchmarks for MLLMs in audio-visual tasks, which is incremental as it focuses on assessment rather than new methods.

The paper tackled the lack of comprehensive evaluation of audio-visual capabilities in multi-modal large language models (MLLMs) by assessing them across effectiveness, efficiency, generalizability, and robustness, finding that MLLMs show strong zero-shot and few-shot generalization but are heavily reliant on visual input and susceptible to adversarial attacks, though more robust than traditional models.

Multi-modal large language models (MLLMs) have recently achieved great success in processing and understanding information from diverse modalities (e.g., text, audio, and visual signals). Despite their growing popularity, there remains a lack of comprehensive evaluation measuring the audio-visual capabilities of these models, especially in diverse scenarios (e.g., distribution shifts and adversarial attacks). In this paper, we present a multifaceted evaluation of the audio-visual capability of MLLMs, focusing on four key dimensions: effectiveness, efficiency, generalizability, and robustness. Through extensive experiments, we find that MLLMs exhibit strong zero-shot and few-shot generalization abilities, enabling them to achieve great performance with limited data. However, their success relies heavily on the vision modality, which impairs performance when visual input is corrupted or missing. Additionally, while MLLMs are susceptible to adversarial samples, they demonstrate greater robustness compared to traditional models. The experimental results and our findings provide insights into the audio-visual capabilities of MLLMs, highlighting areas for improvement and offering guidance for future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes