CL AINov 5, 2025

ChiMDQA: Towards Comprehensive Chinese Document QA with Fine-grained Evaluation

Jing Gao, Shutiao Luo, Yumeng Liu, Yuanming Li, Hongji Zeng

arXiv:2511.03656v12.7ICANN

Originality Synthesis-oriented

AI Analysis

This provides a comprehensive dataset for Chinese NLP tasks, addressing a specific need in the field, though it is incremental as it builds on existing QA dataset efforts.

The authors tackled the lack of high-quality Chinese document question-answering datasets by creating ChiMDQA, a dataset with 6,068 QA pairs across six domains and ten fine-grained categories, designed for downstream business applications.

With the rapid advancement of natural language processing (NLP) technologies, the demand for high-quality Chinese document question-answering datasets is steadily growing. To address this issue, we present the Chinese Multi-Document Question Answering Dataset(ChiMDQA), specifically designed for downstream business scenarios across prevalent domains including academic, education, finance, law, medical treatment, and news. ChiMDQA encompasses long-form documents from six distinct fields, consisting of 6,068 rigorously curated, high-quality question-answer (QA) pairs further classified into ten fine-grained categories. Through meticulous document screening and a systematic question-design methodology, the dataset guarantees both diversity and high quality, rendering it applicable to various NLP tasks such as document comprehension, knowledge extraction, and intelligent QA systems. Additionally, this paper offers a comprehensive overview of the dataset's design objectives, construction methodologies, and fine-grained evaluation system, supplying a substantial foundation for future research and practical applications in Chinese QA. The code and data are available at: https://anonymous.4open.science/r/Foxit-CHiMDQA/.

View on arXiv PDF

Similar