CLMay 18, 2025

KG-QAGen: A Knowledge-Graph-Based Framework for Systematic Question Generation and Long-Context LLM Evaluation

Georgia Tech
arXiv:2505.12495v12 citationsh-index: 29Has Code
Originality Highly original
AI Analysis

This addresses the problem of systematically assessing long-context LLM capabilities for researchers and developers, though it is incremental as it builds on existing benchmarks with a structured approach.

The authors tackled the need for evaluating long-context language models by introducing KG-QAGen, a framework that generates QA pairs at multiple complexity levels from financial agreements, resulting in a dataset of 20,139 pairs and revealing that top models struggle with set-based comparisons and multi-hop inference.

The increasing context length of modern language models has created a need for evaluating their ability to retrieve and process information across extensive documents. While existing benchmarks test long-context capabilities, they often lack a structured way to systematically vary question complexity. We introduce KG-QAGen (Knowledge-Graph-based Question-Answer Generation), a framework that (1) extracts QA pairs at multiple complexity levels (2) by leveraging structured representations of financial agreements (3) along three key dimensions -- multi-hop retrieval, set operations, and answer plurality -- enabling fine-grained assessment of model performance across controlled difficulty levels. Using this framework, we construct a dataset of 20,139 QA pairs (the largest number among the long-context benchmarks) and open-source a part of it. We evaluate 13 proprietary and open-source LLMs and observe that even the best-performing models are struggling with set-based comparisons and multi-hop logical inference. Our analysis reveals systematic failure modes tied to semantic misinterpretation and inability to handle implicit relations.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes