CLAILGMMASMay 1, 2021

AlloST: Low-resource Speech Translation without Source Transcription

arXiv:2105.00171v310 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of speech translation in low-resource settings where source transcriptions are unavailable, offering a practical solution for multilingual applications.

The paper tackles low-resource speech translation without source transcription by proposing a framework using a universal phone recognizer and byte pair encoding, achieving performance close to existing methods that use transcription on Fisher Spanish-English and Taigi-Mandarin datasets.

The end-to-end architecture has made promising progress in speech translation (ST). However, the ST task is still challenging under low-resource conditions. Most ST models have shown unsatisfactory results, especially in the absence of word information from the source speech utterance. In this study, we survey methods to improve ST performance without using source transcription, and propose a learning framework that utilizes a language-independent universal phone recognizer. The framework is based on an attention-based sequence-to-sequence model, where the encoder generates the phonetic embeddings and phone-aware acoustic representations, and the decoder controls the fusion of the two embedding streams to produce the target token sequence. In addition to investigating different fusion strategies, we explore the specific usage of byte pair encoding (BPE), which compresses a phone sequence into a syllable-like segmented sequence. Due to the conversion of symbols, a segmented sequence represents not only pronunciation but also language-dependent information lacking in phones. Experiments conducted on the Fisher Spanish-English and Taigi-Mandarin drama corpora show that our method outperforms the conformer-based baseline, and the performance is close to that of the existing best method using source transcription.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes