CL AI LGOct 13, 2022

CLASP: Few-Shot Cross-Lingual Data Augmentation for Semantic Parsing

Andy Rosenbaum, Saleh Soltan, Wael Hamza, Amir Saffari, Marco Damonte, Isabel Groves

Amazon

arXiv:2210.07074v224.8304 citationsh-index: 29

Originality Incremental advance

AI Analysis

This work addresses the bottleneck of data scarcity for semantic parsing, particularly in multilingual contexts, by providing a method to enhance performance of moderate-sized models, though it is incremental as it builds on existing data augmentation techniques.

The authors tackled the problem of scarce labeled data for semantic parsing in multilingual settings by generating synthetic data from a large language model to augment training for a smaller model, achieving significant improvements over baselines on low-resource datasets like English PIZZA and mTOP cross-lingual zero-shot.

A bottleneck to developing Semantic Parsing (SP) models is the need for a large volume of human-labeled training data. Given the complexity and cost of human annotation for SP, labeled data is often scarce, particularly in multilingual settings. Large Language Models (LLMs) excel at SP given only a few examples, however LLMs are unsuitable for runtime systems which require low latency. In this work, we propose CLASP, a simple method to improve low-resource SP for moderate-sized models: we generate synthetic data from AlexaTM 20B to augment the training set for a model 40x smaller (500M parameters). We evaluate on two datasets in low-resource settings: English PIZZA, containing either 348 or 16 real examples, and mTOP cross-lingual zero-shot, where training data is available only in English, and the model must generalize to four new languages. On both datasets, we show significant improvements over strong baseline methods.

View on arXiv PDF

Similar