CL DBFeb 26, 2025

MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering

Teng Lin, Yuyu Luo, Honglin Zhang, Jicheng Zhang, Chunlin Liu, Kaishun Wu, Nan Tang

arXiv:2502.18993v315 citationsh-index: 3EMNLP

Originality Synthesis-oriented

AI Analysis

This addresses a critical bottleneck in AI for tasks requiring integration of entity-dense information from multiple documents, though it is incremental as it focuses on benchmarking rather than solving the problem.

The paper tackles the problem of cross-document multi-entity question answering (MEQA), where large language models struggle to consolidate scattered information, and introduces MEBench, a benchmark with 4,780 questions that reveals models like GPT-4 achieve only 59% accuracy.

Multi-entity question answering (MEQA) represents significant challenges for large language models (LLM) and retrieval-augmented generation (RAG) systems, which frequently struggle to consolidate scattered information across diverse documents. While existing methods excel at single-document comprehension, they often struggle with cross-document aggregation, particularly when resolving entity-dense questions like "What is the distribution of ACM Fellows among various fields of study?", which require integrating entity-centric insights from heterogeneous sources (e.g., Wikipedia pages). To address this gap, we introduce MEBench, a novel multi-document, multi-entity benchmark designed to systematically evaluate LLMs' capacity to retrieve, consolidate, and reason over fragmented information. Our benchmark comprises 4,780 questions which are systematically categorized into three primary categories, further divided into eight distinct types, ensuring broad coverage of real-world multi-entity reasoning scenarios. Our experiments on state-of-the-art LLMs (e.g., GPT-4, Llama-3) and RAG pipelines reveal critical limitations: even advanced models achieve only 59% accuracy on MEBench. Our benchmark emphasizes the importance of completeness and factual precision of information extraction in MEQA tasks, using Entity-Attributed F1 (EA-F1) metric for granular evaluation of entity-level correctness and attribution validity. MEBench not only highlights systemic weaknesses in current LLM frameworks but also provides a foundation for advancing robust, entity-aware QA architectures.

View on arXiv PDF

Similar