CLASDec 24, 2024

Zero-resource Speech Translation and Recognition with LLMs

Amazon
arXiv:2412.18566v27 citationsh-index: 18ICASSP
Originality Incremental advance
AI Analysis

This addresses the problem of speech processing for languages without paired audio-text data, though it is incremental as it builds on existing LLM and encoder methods.

The paper tackled zero-resource speech translation and recognition by using a multilingual LLM with a speech encoder and adaptation module, achieving BLEU scores over 23 for translation and WERs up to 28.2% for recognition in unseen languages.

Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2\%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes