64.0CVMay 29
Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual EvidenceYuhan Wang, Shuochen Chang, Yalin Feng et al.
Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus. To leverage this insight, we propose EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning), a training-free evidence-centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making. Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.
84.8CLMay 13Code
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document IntelligenceDongsheng Ma, Jiayu Li, Zhengren Wang et al.
Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it. Our repository is available at https://github.com/opendatalab/CiteVQA.
CLJun 29, 2025Code
Text2VectorSQL: Towards a Unified Interface for Vector Search and SQL QueriesZhengren Wang, Dongwen Yao, Bozhou Li et al.
The proliferation of unstructured data poses a fundamental challenge to traditional database interfaces. While Text-to-SQL has democratized access to structured data, it remains incapable of interpreting semantic or multi-modal queries. Concurrently, vector search has emerged as the de facto standard for querying unstructured data, but its integration with SQL-termed VectorSQL-still relies on manual query crafting and lacks standardized evaluation methodologies, creating a significant gap between its potential and practical application. To bridge this fundamental gap, we introduce and formalize Text2VectorSQL, a novel task to establish a unified natural language interface for seamlessly querying both structured and unstructured data. To catalyze research in this new domain, we present a comprehensive foundational ecosystem, including: (1) A scalable and robust pipeline for synthesizing high-quality Text-to-VectorSQL training data. (2) VectorSQLBench, the first large-scale, multi-faceted benchmark for this task, encompassing 12 distinct combinations across three database backends (SQLite, PostgreSQL, ClickHouse) and four data sources (BIRD, Spider, arXiv, Wikipedia). (3) Several novel evaluation metrics designed for more nuanced performance analysis. Extensive experiments not only confirm strong baseline performance with our trained models, but also reveal the recall degradation challenge: the integration of SQL filters with vector search can lead to more pronounced result omissions than in conventional filtered vector search. By defining the core task, delivering the essential data and evaluation infrastructure, and identifying key research challenges, our work lays the essential groundwork to build the next generation of unified and intelligent data interfaces. Our repository is available at https://github.com/OpenDCAI/Text2VectorSQL.
99.1CVApr 6
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at ScaleBin Wang, Tianyao He, Linke Ouyang et al.
Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy -- large-scale pre-training, hard sample fine-tuning, and GRPO alignment -- sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200$\times$ more parameters.
CLMar 30, 2025
RARE: Retrieval-Augmented Reasoning ModelingZhengren Wang, Jiayang Yu, Dongsheng Ma et al.
Domain-specific intelligence demands specialized knowledge and sophisticated reasoning for problem-solving, posing significant challenges for large language models (LLMs) that struggle with knowledge hallucination and inadequate reasoning capabilities under constrained parameter budgets. Inspired by Bloom's Taxonomy in educational theory, we propose Retrieval-Augmented Reasoning Modeling (RARE), a novel paradigm that decouples knowledge storage from reasoning optimization. RARE externalizes domain knowledge to retrievable sources and internalizes domain-specific reasoning patterns during training. Specifically, by injecting retrieved knowledge into training prompts with masked losses, RARE transforms learning objectives from rote memorization to contextualized reasoning. It enables models to bypass parameter-intensive memorization and prioritize the development of higher-order cognitive processes. Extensive experiments demonstrate that lightweight RARE-trained models (e.g., Llama-3.1-8B) could achieve state-of-the-art performance, surpassing retrieval-augmented GPT-4 and DeepSeek-R1 up to approximately 20\% accuracy. RARE establishes a paradigm shift where maintainable external knowledge bases synergize with compact, reasoning-optimized models, collectively driving more scalable domain-specific intelligence.
CVSep 26, 2025
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document ParsingJunbo Niu, Zheng Liu, Zhuangcheng Gu et al.
We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.
SEApr 30, 2025
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code GenerationSizhe Wang, Zhengren Wang, Dongsheng Ma et al.
Modern software development demands code that is maintainable, testable, and scalable by organizing the implementation into modular components with iterative reuse of existing codes. We formalize this iterative, multi-turn paradigm as codeflow and introduce CodeFlowBench, the first benchmark designed to comprehensively evaluate LLMs' ability to perform codeflow, namely implementing new functionality by reusing existing functions over multiple turns. CodeFlowBench comprises 5,258 problems from Codeforces and is continuously updated via an automated pipeline, which decomposes each problem into subproblems with unit tests based on dependency tree analysis and dataflow analysis. We further propose a novel evaluation framework featured dual assessment protocol and structural metrics derived from dependency trees. Extensive experiments on 16 popular LLMs reveal significant performance degradation in multi-turn scenarios. For instance, o1-mini retains only 20.8% Pass@1 in multi-turn scenario versus 37.8% in single-turn scenario. More fine-grained analysis illustrates that model performance inversely correlates with dependency complexity. These findings not only highlight the critical challenges for supporting real-world workflows, but also establish CodeFlowBench as an essential tool for advancing code generation research.