MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation
This addresses the need for more natural and personalized speech translation systems, though it appears incremental as it builds on existing speech language model approaches.
The paper tackles the problem of multilingual speech-to-speech translation without using text data, achieving speaker style preservation in the translated speech.
There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech Language Model (MSLM), which is a decoder-only speech language model trained in a multitask setting. Without reliance on text training data, our model is able to support multilingual S2ST with speaker style preserved.