LGMay 27

Where LLM Annotators Fail: Label-Free Learning on Graphs with LLMs

Safal Thapaliya, Jiatan Huang, Chuxu Zhang

arXiv:2605.2791372.4h-index: 4

Predicted impact top 20% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For graph learning practitioners who rely on LLMs for cheap labeling, this work provides a method to handle the nuanced noise patterns in LLM annotations, improving label-free node classification.

The paper identifies that LLM-generated labels for node classification suffer from cluster-conditional noise, not just global or class-conditional noise. The proposed CANE framework estimates cluster-conditional reliability without ground truth and improves over label-free baselines, with gains of up to 5% on datasets with strong cluster-conditional noise.

Node classification on graphs often requires labeled nodes, yet obtaining labels at graph scale is expensive. When node attributes contain semantic content, such as paper abstracts, web pages, or product descriptions, large language models (LLMs) can provide low-cost supervision by annotating a small subset of nodes. However, these LLM-generated labels are noisy, and existing label-free graph learning methods usually treat this noise as either global or class-conditional. We find that LLM annotation errors are not only class-dependent but also region-dependent: within the same class, reliability can vary sharply across feature-space clusters. In light of this, we propose Cluster-Aware Noise Estimation (CANE), a label-free learning framework that estimates cluster-conditional LLM reliability without ground truth labels, and uses this estimate to decide which pseudo-labels to trust, and which labels to correct. Across various graph benchmarks and GNN backbones, CANE improves over the strongest label-free baselines, with the largest gains on datasets exhibiting stronger cluster-conditional noise.

View on arXiv PDF

Similar