SDCLASApr 10, 2023

ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

CMUNVIDIA
arXiv:2304.04596v3225 citationsh-index: 83Has Code
Originality Synthesis-oriented
AI Analysis

This provides a multipurpose toolkit for the spoken language translation community, though it is incremental as a revamp of an existing toolkit.

The authors tackled the need for a comprehensive toolkit for spoken language translation by developing ESPnet-ST-v2, which supports offline speech-to-text translation, simultaneous speech-to-text translation, and offline speech-to-speech translation with state-of-the-art architectures, resulting in a publicly available open-source tool.

ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-ST-v2 from other open source spoken language translation toolkits. This toolkit offers state-of-the-art architectures such as transducers, hybrid CTC/attention, multi-decoders with searchable intermediates, time-synchronous blockwise CTC/attention, Translatotron models, and direct discrete unit models. In this paper, we describe the overall design, example models for each task, and performance benchmarking behind ESPnet-ST-v2, which is publicly available at https://github.com/espnet/espnet.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes