Iterative Paraphrastic Augmentation with Discriminative Span Alignment
This work addresses the challenge of scaling up language understanding datasets for researchers and practitioners, though it is incremental as it builds upon existing frameworks like FrameNet.
The paper tackles the problem of expanding linguistic annotation resources by introducing a paraphrastic augmentation strategy that automatically generates 495,300 unique (Frame, Trigger) combinations, achieving a 50x expansion of FrameNet v1.7 based on minimal manual effort and compute time.
We introduce a novel paraphrastic augmentation strategy based on sentence-level lexically constrained paraphrasing and discriminative span alignment. Our approach allows for the large-scale expansion of existing resources, or the rapid creation of new resources from a small, manually-produced seed corpus. We illustrate our framework on the Berkeley FrameNet Project, a large-scale language understanding effort spanning more than two decades of human labor. Based on roughly four days of collecting training data for the alignment model and approximately one day of parallel compute, we automatically generate 495,300 unique (Frame, Trigger) combinations annotated in context, a roughly 50x expansion atop FrameNet v1.7.