CLAIDec 13, 2020

C2C-GenDA: Cluster-to-Cluster Generation for Data Augmentation of Slot Filling

arXiv:2012.07004v121 citations
AI Analysis

This work addresses the problem of insufficient and diverse training data for slot filling, a fundamental module in spoken language understanding, which is a common bottleneck for NLP practitioners.

The paper introduces C2C-GenDA, a data augmentation framework for slot filling that reconstructs existing utterances into alternative expressions while preserving semantics. It improves slot filling F-scores by 7.99 (11.9%) on ATIS and 5.76 (13.6%) on Snips datasets when training data is limited to hundreds of utterances.

Slot filling, a fundamental module of spoken language understanding, often suffers from insufficient quantity and diversity of training data. To remedy this, we propose a novel Cluster-to-Cluster generation framework for Data Augmentation (DA), named C2C-GenDA. It enlarges the training set by reconstructing existing utterances into alternative expressions while keeping semantic. Different from previous DA works that reconstruct utterances one by one independently, C2C-GenDA jointly encodes multiple existing utterances of the same semantics and simultaneously decodes multiple unseen expressions. Jointly generating multiple new utterances allows to consider the relations between generated instances and encourages diversity. Besides, encoding multiple existing utterances endows C2C with a wider view of existing expressions, helping to reduce generation that duplicates existing data. Experiments on ATIS and Snips datasets show that instances augmented by C2C-GenDA improve slot filling by 7.99 (11.9%) and 5.76 (13.6%) F-scores respectively, when there are only hundreds of training utterances.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes