CLApr 16, 2024

Incubating Text Classifiers Following User Instruction with Nothing but LLM

arXiv:2404.10877v225 citationsh-index: 8EMNLP
Originality Highly original
AI Analysis

This addresses the need for automated data generation for training small text classifiers in scenarios lacking annotated data, representing a novel advancement over existing methods.

The paper tackles the problem of generating text classification training data from arbitrary class definitions without human annotation, introducing Incubator as the first framework capable of handling complex and mutually dependent classes. Experiments demonstrate it performs well on traditional benchmarks, incorporates label dependencies and user preferences, and enables logical text mining.

In this paper, we aim to generate text classification data given arbitrary class definitions (i.e., user instruction), so one can train a small text classifier without any human annotation or raw corpus. Compared with pioneer attempts, our proposed Incubator is the first framework that can handle complicated and even mutually dependent classes (e.g., "TED Talk given by Educator" and "Other"). Specifically, Incubator is an LLM firstly tuned on the instruction-to-data mappings that we obtained from classification datasets and descriptions on HuggingFace together with in-context augmentation by GPT-4. We then refine Incubator by learning on the cluster centers of semantic textual embeddings to emphasize the uniformity and semantic diversity in generations. We compare Incubator on various classification tasks with strong baselines such as direct LLM-based inference and training data generation by prompt engineering. Experiments show Incubator is able to (1) perform well on traditional benchmarks, (2) take label dependency and user preference into consideration, and (3) enable logical text mining by incubating multiple classifiers.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes