LG AI CL SISep 28, 2025

Knowledge Homophily in Large Language Models

Utkarsh Sahu, Zhisheng Qi, Mahantesh Halappanavar, Nedim Lipka, Ryan A. Rossi, Franck Dernoncourt, Yu Zhang, Yao Ma, Yu Wang

arXiv:2509.23773v14.1h-index: 24WSDM

Originality Incremental advance

AI Analysis

This work addresses the problem of inefficient knowledge extraction and utilization in LLMs for researchers and practitioners in AI, offering incremental improvements in knowledge-intensive applications.

The study investigated the structural organization of knowledge in Large Language Models (LLMs) and discovered a knowledge homophily pattern, where entities closer in a knowledge graph have similar knowledgeability levels, leading to a Graph Neural Network model that improves knowledge coverage and efficiency in tasks like active labeling and question answering.

Large Language Models (LLMs) have been increasingly studied as neural knowledge bases for supporting knowledge-intensive applications such as question answering and fact checking. However, the structural organization of their knowledge remains unexplored. Inspired by cognitive neuroscience findings, such as semantic clustering and priming, where knowing one fact increases the likelihood of recalling related facts, we investigate an analogous knowledge homophily pattern in LLMs. To this end, we map LLM knowledge into a graph representation through knowledge checking at both the triplet and entity levels. After that, we analyze the knowledgeability relationship between an entity and its neighbors, discovering that LLMs tend to possess a similar level of knowledge about entities positioned closer in the graph. Motivated by this homophily principle, we propose a Graph Neural Network (GNN) regression model to estimate entity-level knowledgeability scores for triplets by leveraging their neighborhood scores. The predicted knowledgeability enables us to prioritize checking less well-known triplets, thereby maximizing knowledge coverage under the same labeling budget. This not only improves the efficiency of active labeling for fine-tuning to inject knowledge into LLMs but also enhances multi-hop path retrieval in reasoning-intensive question answering.

View on arXiv PDF

Similar