93.2DBMay 29Code
FDABench: A Benchmark for Data Agents on Analytical Queries over Heterogeneous DataZiting Wang, Shize Zhang, Haitao Yuan et al.
The growing demand for data-driven decision-making has created an urgent need for data agents that can reason over heterogeneous data (databases, documents, web content, images, videos, and audio) to answer complex analytical queries. However, evaluating such agents remains challenging: existing benchmarks often focus on isolated agent capabilities or limited data modalities, lacking comprehensive coverage of heterogeneous data and rigorous evaluation across diverse data agent architectures. To address these challenges, we present FDABench, a benchmark for evaluating data agents' reasoning ability over heterogeneous data in analytical scenarios. Our contributions are threefold: (1) A comprehensive benchmark of 2,007 tasks spanning six data modalities with a unified, multi-granularity evaluation framework. (2) We design PUDDING, an agentic dataset construction framework that leverages LLM generation with iterative expert validation for reliable and scalable benchmark construction. (3) Extensive experiments across diverse data agent architectures, including general analytical agents, semantic operator frameworks, and RAG-based methods, revealing key insights and guidelines for future data agent development. Our data and source code are released at https://github.com/fdabench/FDAbench.
CVFeb 1
Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion ReasoningMeng Luo, Bobo Li, Shanqing Xu et al.
Despite rapid progress in multimodal large language models (MLLMs), their capability for deep emotional understanding remains limited. We argue that genuine affective intelligence requires explicit modeling of Theory of Mind (ToM), the cognitive substrate from which emotions arise. To this end, we introduce HitEmotion, a ToM-grounded hierarchical benchmark that diagnoses capability breakpoints across increasing levels of cognitive depth. Second, we propose a ToM-guided reasoning chain that tracks mental states and calibrates cross-modal evidence to achieve faithful emotional reasoning. We further introduce TMPO, a reinforcement learning method that uses intermediate mental states as process-level supervision to guide and strengthen model reasoning. Extensive experiments show that HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. In evaluation, the ToM-guided reasoning chain and TMPO improve end-task accuracy and yield more faithful, more coherent rationales. In conclusion, our work provides the research community with a practical toolkit for evaluating and enhancing the cognition-based emotional understanding capabilities of MLLMs. Our dataset and code are available at: https://HitEmotion.github.io/.
CVMar 5
UniM: A Unified Any-to-Any Interleaved Multimodal BenchmarkYanlin Li, Minghui Guo, Kaiwen Zhang et al.
In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.