CLAILGNov 27, 2024

Aligning Pre-trained Models for Spoken Language Translation

arXiv:2411.18294v13 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses the problem of efficient speech translation for language processing applications, but it is incremental as it builds on existing pre-trained models.

The paper tackles end-to-end speech translation by aligning frozen pre-trained ASR and MT models with a small connector module, achieving improved translation results as model sizes increase, with connectors also serving as domain adapters to boost performance.

This paper investigates a novel approach to end-to-end speech translation (ST) based on aligning frozen pre-trained automatic speech recognition (ASR) and machine translation (MT) models via a small connector module (Q-Former, our Subsampler-Transformer Encoder). This connector bridges the gap between the speech and text modalities, transforming ASR encoder embeddings into the latent representation space of the MT encoder while being the only part of the system optimized during training. Experiments are conducted on the How2 English-Portuguese dataset as we investigate the alignment approach in a small-scale scenario focusing on ST. While keeping the size of the connector module constant and small in comparison ( < 5% of the size of the larger aligned models), increasing the size and capability of the foundation ASR and MT models universally improves translation results. We also find that the connectors can serve as domain adapters for the foundation MT models, significantly improving translation performance in the aligned ST setting. We conclude that this approach represents a viable and scalable approach to training end-to-end ST systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes