CLAISDASNov 4, 2022

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

arXiv:2211.02499v212 citationsh-index: 57
AI Analysis

This addresses the problem of real-time, multilingual speech translation without extensive labeled data, offering a scalable solution for applications like live translation, though it is incremental in leveraging existing methods.

The paper tackles building a streaming multilingual speech model (SM2) that transcribes or translates spoken languages into target text using weakly supervised data, achieving comparable or better quality than large non-streaming models and demonstrating truly zero-shot capability for unseen language pairs.

In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language. The backbone of SM2 is Transformer Transducer, which has high streaming capability. Instead of human labeled speech translation (ST) data, SM2 models are trained using weakly supervised data generated by converting the transcriptions in speech recognition corpora with a machine translation service. With 351 thousand hours of anonymized speech training data from 25 languages, SM2 models achieve comparable or even better ST quality than some recent popular large-scale non-streaming speech models. More importantly, we show that SM2 has the truly zero-shot capability when expanding to new target languages, yielding high quality ST results for {source-speech, target-text} pairs that are not seen during training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes