CLOct 5, 2020

SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding

arXiv:2010.02295v3752 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of modality disparities in SLU for applications like voice assistants, though it is incremental as it builds on existing pre-training methods.

The paper tackles the problem of spoken language understanding by proposing SPLAT, a semi-supervised framework that jointly pre-trains speech and language modules, improving state-of-the-art performance on the Spoken SQuAD dataset by over 10%.

Spoken language understanding (SLU) requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. To boost the models' performance, various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. However, the inherent disparities between the two modalities necessitate a mutual analysis. In this paper, we propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules. Besides conducting a self-supervised masked language modeling task on the two individual modules using unpaired speech and text, SPLAT aligns representations from the two modules in a shared latent space using a small amount of paired speech and text. Thus, during fine-tuning, the speech module alone can produce representations carrying both acoustic information and contextual semantic knowledge of an input acoustic signal. Experimental results verify the effectiveness of our approach on various SLU tasks. For example, SPLAT improves the previous state-of-the-art performance on the Spoken SQuAD dataset by more than 10%.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes