CVJan 2

AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Mulitmodal Models

arXiv:2601.00561v1h-index: 2
Originality Incremental advance
AI Analysis

This addresses a critical gap in benchmarking for AI researchers by providing a more reliable diagnostic tool, though it is incremental as it focuses on evaluation rather than model development.

The paper tackled the challenge of evaluating Unified Multimodal Models' world knowledge capabilities by proposing AEGIS, a comprehensive multi-task benchmark with 1,050 questions across 21 topics, and found that most models show severe deficits, with performance degrading significantly in complex reasoning tasks.

The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (\emph{i.e.}, \textbf{A}ssessing \textbf{E}diting, \textbf{G}eneration, \textbf{I}nterpretation-Understanding for \textbf{S}uper-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life, etc.) and 6 reasoning types. To concretely evaluate the performance of UMMs in world knowledge scope without ambiguous metrics, we further propose Deterministic Checklist-based Evaluation (DCE), a protocol that replaces ambiguous prompt-based scoring with atomic ``Y/N'' judgments, to enhance evaluation reliability. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and that performance degrades significantly with complex reasoning. Additionally, simple plug-in reasoning modules can partially mitigate these vulnerabilities, highlighting a promising direction for future research. These results highlight the importance of world-knowledge-based reasoning as a critical frontier for UMMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes