CLLGJun 14, 2019

Scalable Syntax-Aware Language Models Using Knowledge Distillation

arXiv:1906.06438v11109 citations
Originality Incremental advance
AI Analysis

This work addresses the computational scalability of syntactic models for natural language processing, showing structural biases remain important even with large data, though it is incremental in applying distillation to this specific bottleneck.

The paper tackled the problem of scaling syntactic language models by using knowledge distillation to transfer structural biases from a small syntactic model to an LSTM, resulting in a new state of the art on targeted syntactic evaluations with substantial improvements over baseline sequential LSTMs.

Prior work has shown that, on small amounts of training data, syntactic neural language models learn structurally sensitive generalisations more successfully than sequential language models. However, their computational complexity renders scaling difficult, and it remains an open question whether structural biases are still necessary when sequential models have access to ever larger amounts of training data. To answer this question, we introduce an efficient knowledge distillation (KD) technique that transfers knowledge from a syntactic language model trained on a small corpus to an LSTM language model, hence enabling the LSTM to develop a more structurally sensitive representation of the larger training data it learns from. On targeted syntactic evaluations, we find that, while sequential LSTMs perform much better than previously reported, our proposed technique substantially improves on this baseline, yielding a new state of the art. Our findings and analysis affirm the importance of structural biases, even in models that learn from large amounts of data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes