CLAIASMar 8, 2025

Late Fusion and Multi-Level Fission Amplify Cross-Modal Transfer in Text-Speech LMs

MIT
arXiv:2503.06211v22 citationsh-index: 21
AI Analysis

This addresses the problem of inefficient cross-modal learning in TSLMs for AI researchers, representing an incremental improvement over existing methods.

The paper tackled the limitation of early modality fusion/fission in Text-Speech Language Models (TSLMs) by proposing late fusion and multi-level fission to enhance cross-modal transfer, resulting in models that rival or surpass state-of-the-art TSLMs with less compute and achieve significantly improved cross-modal performance.

Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- are commonly trained through an early modality fusion/fission approach, in which both modalities are fed and predicted from a shared backbone via linear layers. We hypothesize that this approach limits cross-modal transfer by neglecting feature compositionality -- specifically, the finer-grained nature of speech representations compared to text -- preventing the emergence of a shared feature hierarchy within model layers. In this paper, we argue that this limitation can be addressed through late fusion and fission, with a fission process that accesses both high- and low-level features for speech generation. Our models implementing these principles, SmolTolk, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute, and achieve significantly improved cross-modal performance relative to early fusion/fission baselines. Representation analyses further suggest that our method enhances the model's ability to abstract higher-level, more semantic features from speech, and leads to increasingly shared representation spaces across layers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes