ALINC: Active Learning for Inductive Node Classification via Graph Sampling
For researchers in domains with many independent graphs (e.g., molecular chemistry, EDA), ALINC provides the first active learning approach tailored to inductive node classification, enabling efficient annotation by selecting whole graphs.
ALINC introduces an active learning framework for inductive node classification that selects entire graphs rather than individual nodes, addressing a gap in scenarios where node annotation requires full-graph analysis. The framework achieves strong performance with CoreSet, TypiClust, and BADGE strategies, and shows that aggregation method choice significantly impacts model performance and annotation costs.
Active learning (AL) for node classification typically focuses on selecting the most informative nodes for annotation within one or a few large graphs (e.g., in social network analysis). However, in other domains, such as molecular chemistry or electronic design automation, datasets consist of thousands of independent graphs. In many of these inductive settings, annotating an individual node requires a full-graph analysis, which effectively yields the remaining node labels on-the-fly. Therefore, these scenarios require AL strategies that select entire graphs instead of single nodes, a problem which has not been tackled in the literature so far. Thus, we introduce ALINC, an AL framework for inductive node classification via graph sampling. It bridges the existing methodological gap by elevating node-level utility measures to graph-level selection criteria through various aggregation mechanisms. In an extensive benchmark including ten strategies, three aggregation methods, and four datasets, we identify CoreSet, TypiClust, and BADGE as the top-performing graph sampling strategies. Our detailed analysis further reveals that the choice of the aggregation method is pivotal, as it substantially affects model performance and annotation costs. Finally, we demonstrate the effectiveness of ALINC in two use case studies: site-of-metabolism prediction in molecules and design automation of printed circuit board schematics.