CLMar 15

Automatic Inter-document Multi-hop Scientific QA Generation

Seungmin Lee, Dongha Kim, Yuni Jeon, Junyoung Koh, Min Song

arXiv:2603.1425779.72 citationsh-index: 2

Predicted impact top 64% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the need for realistic benchmarks for retrieval-augmented scientific reasoning, particularly for researchers and developers in AI and NLP, though it is incremental as it builds on existing QA generation methods.

The researchers tackled the limitation of existing automatic scientific question generation methods that focus on single-document factoid QA by developing AIM-SciQA, a framework for generating multi-document, multi-hop scientific QA datasets. The result was the IM-SciQA dataset with 411,409 single-hop and 13,672 multi-hop QAs from 8,211 PubMed Central papers, validated for factual consistency and effectiveness in differentiating reasoning capabilities.

Existing automatic scientific question generation studies mainly focus on single-document factoid QA, overlooking the inter-document reasoning crucial for scientific understanding. We present AIM-SciQA, an automated framework for generating multi-document, multi-hop scientific QA datasets. AIM-SciQA extracts single-hop QAs using large language models (LLMs) with machine reading comprehension and constructs cross-document relations based on embedding-based semantic alignment while selectively leveraging citation information. Applied to 8,211 PubMed Central papers, it produced 411,409 single-hop and 13,672 multi-hop QAs, forming the IM-SciQA dataset. Human and automatic validation confirmed high factual consistency, and experimental results demonstrate that IM-SciQA effectively differentiates reasoning capabilities across retrieval and QA stages, providing a realistic and interpretable benchmark for retrieval-augmented scientific reasoning. We further extend this framework to construct CIM-SciQA, a citation-guided variant achieving comparable performance to the Oracle setting, reinforcing the dataset's validity and generality.

View on arXiv PDF

Similar