CLAILGAug 25, 2023

Rethinking Language Models as Symbolic Knowledge Graphs

CMU
arXiv:2308.13676v116 citationsh-index: 20
Originality Incremental advance
AI Analysis

This work addresses a gap in evaluating language models for knowledge graph reasoning, which is important for applications like search and question answering, but it is incremental as it builds on existing evaluation methods.

The paper tackled the problem of evaluating whether language models can capture the complex topological and semantic attributes of symbolic knowledge graphs, and found that while models show potential for factual recall, their ability to handle these intricate traits remains significantly constrained. The results include benchmarks where larger models like GPT-4 do not universally outperform smaller ones like BERT.

Symbolic knowledge graphs (KGs) play a pivotal role in knowledge-centric applications such as search, question answering and recommendation. As contemporary language models (LMs) trained on extensive textual data have gained prominence, researchers have extensively explored whether the parametric knowledge within these models can match up to that present in knowledge graphs. Various methodologies have indicated that enhancing the size of the model or the volume of training data enhances its capacity to retrieve symbolic knowledge, often with minimal or no human supervision. Despite these advancements, there is a void in comprehensively evaluating whether LMs can encompass the intricate topological and semantic attributes of KGs, attributes crucial for reasoning processes. In this work, we provide an exhaustive evaluation of language models of varying sizes and capabilities. We construct nine qualitative benchmarks that encompass a spectrum of attributes including symmetry, asymmetry, hierarchy, bidirectionality, compositionality, paths, entity-centricity, bias and ambiguity. Additionally, we propose novel evaluation metrics tailored for each of these attributes. Our extensive evaluation of various LMs shows that while these models exhibit considerable potential in recalling factual information, their ability to capture intricate topological and semantic traits of KGs remains significantly constrained. We note that our proposed evaluation metrics are more reliable in evaluating these abilities than the existing metrics. Lastly, some of our benchmarks challenge the common notion that larger LMs (e.g., GPT-4) universally outshine their smaller counterparts (e.g., BERT).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes