LGMay 20, 2023

Semi-Supervised Graph Imbalanced Regression

Gang Liu, Tong Zhao, Eric Inae, Tengfei Luo, Meng Jiang

arXiv:2305.12087v116.027 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the problem of expensive labeling and data scarcity in graph regression for domains like chemistry, though it is incremental as it builds on existing semi-supervised and data balancing techniques.

The paper tackles data imbalance in graph regression tasks, particularly for molecule and polymer property predictions, by proposing a semi-supervised framework that uses pseudo-labeling and latent space augmentation to balance training data, resulting in significantly reduced prediction errors, especially for under-represented labels.

Data imbalance is easily found in annotated data when the observations of certain continuous label values are difficult to collect for regression tasks. When they come to molecule and polymer property predictions, the annotated graph datasets are often small because labeling them requires expensive equipment and effort. To address the lack of examples of rare label values in graph regression tasks, we propose a semi-supervised framework to progressively balance training data and reduce model bias via self-training. The training data balance is achieved by (1) pseudo-labeling more graphs for under-represented labels with a novel regression confidence measurement and (2) augmenting graph examples in latent space for remaining rare labels after data balancing with pseudo-labels. The former is to identify quality examples from unlabeled data whose labels are confidently predicted and sample a subset of them with a reverse distribution from the imbalanced annotated data. The latter collaborates with the former to target a perfect balance using a novel label-anchored mixup algorithm. We perform experiments in seven regression tasks on graph datasets. Results demonstrate that the proposed framework significantly reduces the error of predicted graph properties, especially in under-represented label areas.

View on arXiv PDF Code

Similar