CLSDASMar 13, 2025

Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation

arXiv:2503.10211v13 citationsh-index: 10NLPCC
Originality Incremental advance
AI Analysis

This work addresses the challenge of cross-modal learning in speech translation for AI systems, representing an incremental advancement by focusing on inner-layer alignment rather than just input-output mapping.

The paper tackles the problem of modality gaps in speech translation by proposing an Adaptive Inner Speech-Text Alignment method to align speech and text representations within large language models, resulting in significant improvements in translation performance over previous state-of-the-art approaches.

Recent advancement of large language models (LLMs) has led to significant breakthroughs across various tasks, laying the foundation for the development of LLM-based speech translation systems. Existing methods primarily focus on aligning inputs and outputs across modalities while overlooking deeper semantic alignment within model representations. To address this limitation, we propose an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality gap by explicitly aligning speech and text representations at selected layers within LLMs. To achieve this, we leverage the optimal transport (OT) theory to quantify fine-grained representation discrepancies between speech and text. Furthermore, we utilize the cross-modal retrieval technique to identify the layers that are best suited for alignment and perform joint training on these layers. Experimental results on speech translation (ST) tasks demonstrate that AI-STA significantly improves the translation performance of large speech-text models (LSMs), outperforming previous state-of-the-art approaches. Our findings highlight the importance of inner-layer speech-text alignment in LLMs and provide new insights into enhancing cross-modal learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes