CLOct 18, 2023

Systematic Assessment of Factual Knowledge in Large Language Models

arXiv:2310.11638v3138 citationsh-index: 44
Originality Incremental advance
AI Analysis

This work addresses the need for more systematic and comprehensive evaluation of factual knowledge in LLMs, which is incremental as it builds on existing methods by leveraging knowledge graphs for better coverage.

The paper tackles the problem of evaluating factual knowledge in large language models (LLMs) by proposing a framework that uses knowledge graphs to generate questions and assess accuracy, finding that ChatGPT consistently performs best across domains with performance influenced by instruction finetuning, domain, question complexity, and adversarial context.

Previous studies have relied on existing question-answering benchmarks to evaluate the knowledge stored in large language models (LLMs). However, this approach has limitations regarding factual knowledge coverage, as it mostly focuses on generic domains which may overlap with the pretraining data. This paper proposes a framework to systematically assess the factual knowledge of LLMs by leveraging knowledge graphs (KGs). Our framework automatically generates a set of questions and expected answers from the facts stored in a given KG, and then evaluates the accuracy of LLMs in answering these questions. We systematically evaluate the state-of-the-art LLMs with KGs in generic and specific domains. The experiment shows that ChatGPT is consistently the top performer across all domains. We also find that LLMs performance depends on the instruction finetuning, domain and question complexity and is prone to adversarial context.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes