CLASNov 1, 2023

End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

AmazonAppleMIT
arXiv:2311.00697v1132 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses the challenge of real-life conversational speech translation for applications like meeting transcription, though it is incremental as it builds on existing methods.

The paper tackles the problem of single-channel multi-speaker conversational speech translation, which conventional systems struggle with, by proposing an end-to-end multi-task model that outperforms reference systems in multi-speaker conditions while maintaining comparable performance in single-speaker conditions.

Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combines automatic speech recognition, speech translation and speaker turn detection using special tokens in a serialized labeling format. We run experiments on the Fisher-CALLHOME corpus, which we adapted by merging the two single-speaker channels into one multi-speaker channel, thus representing the more realistic and challenging scenario with multi-speaker turns and cross-talk. Experimental results across single- and multi-speaker conditions and against conventional ST systems, show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition. We release scripts for data processing and model training.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes