CLSDASOct 21, 2022

A Textless Metric for Speech-to-Speech Comparison

arXiv:2210.11835v25 citationsh-index: 42
Originality Incremental advance
AI Analysis

This provides a solution for evaluating speech-to-speech translation in oral languages or languages without reliable ASR systems, though it is incremental as it builds on existing speech2unit encoders.

The paper tackles the problem of comparing speech utterances without text transcripts by introducing a textless metric that uses speech2unit encoders and a neural architecture, achieving results that closely correspond to text-based metrics.

In this paper, we introduce a new and simple method for comparing speech utterances without relying on text transcripts. Our speech-to-speech comparison metric utilizes state-of-the-art speech2unit encoders like HuBERT to convert speech utterances into discrete acoustic units. We then propose a simple and easily replicable neural architecture that learns a speech-based metric that closely corresponds to its text-based counterpart. This textless metric has numerous potential applications, including evaluating speech-to-speech translation for oral languages, languages without dependable ASR systems, or to avoid the need for ASR transcription altogether. This paper also shows that for speech-to-speech translation evaluation, ASR-BLEU (which consists in automatically transcribing both speech hypothesis and reference and compute sentence-level BLEU between transcripts) is a poor proxy to real text-BLEU even when ASR system is strong.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes