LG CLOct 28, 2024

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Hsun-Yu Kuo, Yin-Hsiang Liao, Yu-Chieh Chao, Wei-Yun Ma, Pu-Jen Cheng

arXiv:2410.21526v211.56 citationsh-index: 2ICLR

Originality Incremental advance

AI Analysis

This work addresses the challenge of effectively leveraging synthetic data for model training in text classification, particularly when real data is scarce, though it is incremental as it builds on existing data weighting techniques.

The paper tackled the problem of synthetic data from large language models deviating from real-world distributions, which can harm model performance, and proposed weighted-loss approaches to align synthetic data, resulting in robust outperformance of standard methods on text classification tasks with BERT-level models.

Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training.

View on arXiv PDF

Similar