CVMar 5

UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

Yanlin Li, Minghui Guo, Kaiwen Zhang, Shize Zhang, Yiran Zhao, Haodong Li, Congyue Zhou, Weijie Zheng, Yushen Yan, Shengqiong Wu, Wei Ji, Lei Cui

arXiv:2603.05075v11.53 citationsh-index: 5

Originality Highly original

AI Analysis

This work addresses the critical need for a unified benchmark to evaluate MLLMs on their 'any-to-any' interleaved multimodal capabilities, which is a significant challenge for the entire MLLM research community.

This paper introduces UniM, a new benchmark for evaluating multimodal large language models (MLLMs) on their ability to process and generate arbitrarily combined and interleaved multimodal inputs and outputs. The benchmark consists of 31K high-quality instances across 30 domains and 7 modalities, and an evaluation suite to assess semantic correctness, generation quality, response structure integrity, and interleaved coherence.

In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.

View on arXiv PDF

Similar