CLFeb 2, 2021

CTC-based Compression for Direct Speech Translation

arXiv:2102.01578v1812 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of efficiently handling input audio for direct speech translation models, which is beneficial for researchers and practitioners working on end-to-end speech translation systems.

This paper proposes a method for dynamic compression of input in direct speech translation (ST) models, utilizing Connectionist Temporal Classification (CTC) to compress input sequences based on phonetic characteristics. The method achieves a 1.3-1.5 BLEU improvement over a strong baseline on English-Italian and English-German language pairs, while reducing memory footprint by over 10%.

Previous studies demonstrated that a dynamic phone-informed compression of the input audio is beneficial for speech translation (ST). However, they required a dedicated model for phone recognition and did not test this solution for direct ST, in which a single model translates the input audio into the target language without intermediate representations. In this work, we propose the first method able to perform a dynamic compression of the input indirect ST models. In particular, we exploit the Connectionist Temporal Classification (CTC) to compress the input sequence according to its phonetic characteristics. Our experiments demonstrate that our solution brings a 1.3-1.5 BLEU improvement over a strong baseline on two language pairs (English-Italian and English-German), contextually reducing the memory footprint by more than 10%.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes