CLSDASOct 23, 2023

Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation

arXiv:2310.14806v14 citationsh-index: 34
Originality Highly original
AI Analysis

This addresses the need for efficient real-time spoken language processing for global communication applications, representing a novel method rather than an incremental improvement.

The paper tackled the problem of inefficient separate systems for automatic speech recognition and speech translation by proposing a streaming Transformer-Transducer model that jointly produces transcription and translation with a single decoder, achieving effective results in experiments on multiple language pairs.

The growing need for instant spoken language transcription and translation is driven by increased global communication and cross-lingual interactions. This has made offering translations in multiple languages essential for user applications. Traditional approaches to automatic speech recognition (ASR) and speech translation (ST) have often relied on separate systems, leading to inefficiencies in computational resources, and increased synchronization complexity in real time. In this paper, we propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder. We introduce a novel method for joint token-level serialized output training based on timestamp information to effectively produce ASR and ST outputs in the streaming setting. Experiments on {it,es,de}->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes