Accelerating Natural Language Understanding in Task-Oriented Dialog
This work addresses resource-efficiency concerns for on-device deployment in natural language understanding, offering a practical solution for mobile or edge applications, though it is incremental as it builds on existing compression techniques.
The paper tackled the problem of deploying large, resource-intensive task-oriented dialog models on-device by proposing a compressed convolutional model that achieves comparable results to BERT on ATIS and Snips with under 100K parameters and predicts intents and slots 63x faster than DistilBERT on CPUs.
Task-oriented dialog models typically leverage complex neural architectures and large-scale, pre-trained Transformers to achieve state-of-the-art performance on popular natural language understanding benchmarks. However, these models frequently have in excess of tens of millions of parameters, making them impossible to deploy on-device where resource-efficiency is a major concern. In this work, we show that a simple convolutional model compressed with structured pruning achieves largely comparable results to BERT on ATIS and Snips, with under 100K parameters. Moreover, we perform acceleration experiments on CPUs, where we observe our multi-task model predicts intents and slots nearly 63x faster than even DistilBERT.