CL AIJun 27, 2024

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

Yinheng Li, Rogerio Bonatti, Sara Abdali, Justin Wagle, Kazuhito Koishida

arXiv:2407.12813v27.712 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of optimizing synthetic data generation for text classification tasks, but it is incremental as it builds on existing methods without introducing new paradigms.

The paper tackled the problem of using Large Language Models to generate synthetic data for text classification by empirically analyzing factors like prompt choice and data quality, finding that these factors significantly affect model performance, though specific numerical results are not provided.

Using Large Language Models (LLMs) to generate synthetic data for model training has become increasingly popular in recent years. While LLMs are capable of producing realistic training data, the effectiveness of data generation is influenced by various factors, including the choice of prompt, task complexity, and the quality, quantity, and diversity of the generated data. In this work, we focus exclusively on using synthetic data for text classification tasks. Specifically, we use natural language understanding (NLU) models trained on synthetic data to assess the quality of synthetic data from different generation approaches. This work provides an empirical analysis of the impact of these factors and offers recommendations for better data generation practices.

View on arXiv PDF

Similar