Few-shot Mining of Naturally Occurring Inputs and Outputs
This addresses the problem of expensive data labeling for NLP practitioners, offering a method to augment training data with high-quality, naturally occurring examples, though it is incremental as it builds on existing mining and ranking techniques.
The paper tackles the high cost of creating labeled natural language training data by mining naturally occurring input-output pairs from large corpora using a two-stage method with a small seed set, resulting in improvements of 13 F1 on SQuAD-style reading comprehension and 1.46 ROUGE-L on Xsum abstractive summarization.
Creating labeled natural language training data is expensive and requires significant human effort. We mine input output examples from large corpora using a supervised mining function trained using a small seed set of only 100 examples. The mining consists of two stages -- (1) a biencoder-based recall-oriented dense search which pairs inputs with potential outputs, and (2) a crossencoder-based filter which re-ranks the output of the biencoder stage for better precision. Unlike model-generated data augmentation, our method mines naturally occurring high-quality input output pairs to mimic the style of the seed set for multiple tasks. On SQuAD-style reading comprehension, augmenting the seed set with the mined data results in an improvement of 13 F1 over a BART-large baseline fine-tuned only on the seed set. Likewise, we see improvements of 1.46 ROUGE-L on Xsum abstractive summarization.