Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents
This addresses the challenge of aligning speech data for translation tasks, offering a more robust method for researchers and practitioners in speech processing, though it is incremental as it builds on existing mining techniques.
The authors tackled the problem of aligning parallel speech documents without text transcriptions by proposing Speech Vecalign, which produced longer and less noisy alignments than baseline methods, yielding about 1,000 hours of high-quality alignments from 3,000 hours of data and improving speech-to-speech translation performance by up to 0.37 ASR-BLEU.
We present Speech Vecalign, a parallel speech document alignment method that monotonically aligns speech segment embeddings and does not depend on text transcriptions. Compared to the baseline method Global Mining, a variant of speech mining, Speech Vecalign produces longer speech-to-speech alignments. It also demonstrates greater robustness than Local Mining, another speech mining variant, as it produces less noise. We applied Speech Vecalign to 3,000 hours of unlabeled parallel English-German (En-De) speech documents from VoxPopuli, yielding about 1,000 hours of high-quality alignments. We then trained En-De speech-to-speech translation models on the aligned data. Speech Vecalign improves the En-to-De and De-to-En performance over Global Mining by 0.37 and 0.18 ASR-BLEU, respectively. Moreover, our models match or outperform SpeechMatrix model performance, despite using 8 times fewer raw speech documents.