CLSDASFeb 29, 2024

Compact Speech Translation Models via Discrete Speech Units Pretraining

arXiv:2402.19333v227 citationsh-index: 45IWSLT
Originality Incremental advance
AI Analysis

This work addresses the need for efficient speech translation in low-resource and on-device settings, though it is incremental as it builds on existing self-supervised speech models.

The paper tackles the problem of creating compact speech-to-text translation models for memory-constrained scenarios like on-device deployment, achieving consistent improvements over baselines on CoVoST-2 (X-En) with a model half the size of the Self-Supervised Speech model.

We propose a pretraining method to use Self-Supervised Speech (SSS) model to creating more compact Speech-to-text Translation. In contrast to using the SSS model for initialization, our method is more suitable to memory constrained scenario such as on-device deployment. Our method is based on Discrete Speech Units (DSU) extracted from the SSS model. In the first step, our method pretrains two smaller encoder-decoder models on 1) Filterbank-to-DSU (Fbk-to-DSU) and 2) DSU-to-Translation (DSU-to-Trl) data respectively. The DSU thus become the distillation inputs of the smaller models. Subsequently, the encoder from the Fbk-to-DSU model and the decoder from the DSU-to-Trl model are taken to initialise the compact model. Finally, the compact model is finetuned on the paired Fbk-Trl data. In addition to being compact, our method requires no transcripts, making it applicable to low-resource settings. It also avoids speech discretization in inference and is more robust to the DSU tokenization. Evaluation on CoVoST-2 (X-En) shows that our method has consistent improvement over the baseline in three metrics while being compact i.e., only half the SSS model size.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes