SYNAPSE-G: Bridging Large Language Models and Graph Learning for Rare Event Classification
This addresses the cold-start problem for rare event classification in machine learning, though it is incremental as it builds on existing LLM and graph learning techniques.
The paper tackles the problem of scarce labeled data for rare event classification by proposing SYNAPSE-G, a pipeline that uses Large Language Models to generate synthetic data and semi-supervised graph learning to expand the dataset, showing effectiveness in experiments on imbalanced datasets like SST2 and MHS with improved performance over baselines.
Scarcity of labeled data, especially for rare events, hinders training effective machine learning models. This paper proposes SYNAPSE-G (Synthetic Augmentation for Positive Sampling via Expansion on Graphs), a novel pipeline leveraging Large Language Models (LLMs) to generate synthetic training data for rare event classification, addressing the cold-start problem. This synthetic data serve as seeds for semi-supervised label propagation on a similarity graph constructed between the seeds and a large unlabeled dataset. This identifies candidate positive examples, subsequently labeled by an oracle (human or LLM). The expanded dataset then trains/fine-tunes a classifier. We theoretically analyze how the quality (validity and diversity) of the synthetic data impacts the precision and recall of our method. Experiments on the imbalanced SST2 and MHS datasets demonstrate SYNAPSE-G's effectiveness in finding positive labels, outperforming baselines including nearest neighbor search.