SD CL ASApr 10, 2023

ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma, Yifan Peng, Siddharth Dalmia, Peter Polák, Patrick Fernandes, Dan Berrebbi, Tomoki Hayashi, Xiaohui Zhang, Zhaoheng Ni

CMUNVIDIA

arXiv:2304.04596v344.3225 citationsh-index: 83Has Code

Originality Synthesis-oriented

AI Analysis

This provides a multipurpose toolkit for the spoken language translation community, though it is incremental as a revamp of an existing toolkit.

The authors tackled the need for a comprehensive toolkit for spoken language translation by developing ESPnet-ST-v2, which supports offline speech-to-text translation, simultaneous speech-to-text translation, and offline speech-to-speech translation with state-of-the-art architectures, resulting in a publicly available open-source tool.

ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-ST-v2 from other open source spoken language translation toolkits. This toolkit offers state-of-the-art architectures such as transducers, hybrid CTC/attention, multi-decoders with searchable intermediates, time-synchronous blockwise CTC/attention, Translatotron models, and direct discrete unit models. In this paper, we describe the overall design, example models for each task, and performance benchmarking behind ESPnet-ST-v2, which is publicly available at https://github.com/espnet/espnet.

View on arXiv PDF Code

Similar