Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention
This addresses the problem of real-time speech translation for users needing low-latency communication, but it is incremental as it builds on existing direct speech-to-speech translation methods.
The paper tackles simultaneous speech-to-speech translation by proposing a direct model that bypasses intermediate text, using discrete units and a variational monotonic multihead attention to improve policy learning. It shows the direct model outperforms cascaded approaches in balancing translation quality and latency on Fisher Spanish-English and MuST-C English-Spanish datasets.
We present a direct simultaneous speech-to-speech translation (Simul-S2ST) model, Furthermore, the generation of translation is independent from intermediate text representations. Our approach leverages recent progress on direct speech-to-speech translation with discrete units, in which a sequence of discrete representations, instead of continuous spectrogram features, learned in an unsupervised manner, are predicted from the model and passed directly to a vocoder for speech synthesis on-the-fly. We also introduce the variational monotonic multihead attention (V-MMA), to handle the challenge of inefficient policy learning in speech simultaneous translation. The simultaneous policy then operates on source speech features and target discrete units. We carry out empirical studies to compare cascaded and direct approach on the Fisher Spanish-English and MuST-C English-Spanish datasets. Direct simultaneous model is shown to outperform the cascaded model by achieving a better tradeoff between translation quality and latency.