CVAICLJun 12, 2024

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

arXiv:2406.08407v339 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This provides a new benchmark for assessing world model abilities in videos, addressing a gap in multimodal AI evaluation, though it is incremental in building on existing video understanding benchmarks.

The paper tackles the problem of evaluating multimodal language models as world models by introducing MMWorld, a benchmark for multi-discipline, multi-faceted video understanding, which reveals that current models, including GPT-4V, achieve only 52.3% accuracy, indicating significant room for improvement.

Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4V performs the best with only 52.3\% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models' different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes