CLIRLGMay 18, 2023

ReGen: Zero-Shot Text Classification via Training Data Generation with Progressive Dense Retrieval

arXiv:2305.10703v1238 citations
Originality Incremental advance
AI Analysis

This addresses the problem of efficient and effective zero-shot learning for NLP practitioners, offering a faster alternative to existing data generation methods.

The paper tackles zero-shot text classification by generating training data from a general-domain corpus using a retrieval-enhanced framework, achieving a 4.3% performance gain over baselines and saving 70% of time compared to methods using large natural language generation models.

With the development of large language models (LLMs), zero-shot learning has attracted much attention for various NLP tasks. Different from prior works that generate training data with billion-scale natural language generation (NLG) models, we propose a retrieval-enhanced framework to create training data from a general-domain unlabeled corpus. To realize this, we first conduct contrastive pretraining to learn an unsupervised dense retriever for extracting the most relevant documents using class-descriptive verbalizers. We then further propose two simple strategies, namely Verbalizer Augmentation with Demonstrations and Self-consistency Guided Filtering to improve the topic coverage of the dataset while removing noisy examples. Experiments on nine datasets demonstrate that REGEN achieves 4.3% gain over the strongest baselines and saves around 70% of the time compared to baselines using large NLG models. Besides, REGEN can be naturally integrated with recently proposed large language models to boost performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes