Late Fusion and Multi-Level Fission Amplify Cross-Modal Transfer in Text-Speech LMs
This addresses the problem of inefficient cross-modal learning in TSLMs for AI researchers, representing an incremental improvement over existing methods.
The paper tackled the limitation of early modality fusion/fission in Text-Speech Language Models (TSLMs) by proposing late fusion and multi-level fission to enhance cross-modal transfer, resulting in models that rival or surpass state-of-the-art TSLMs with less compute and achieve significantly improved cross-modal performance.
Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- are commonly trained through an early modality fusion/fission approach, in which both modalities are fed and predicted from a shared backbone via linear layers. We hypothesize that this approach limits cross-modal transfer by neglecting feature compositionality -- specifically, the finer-grained nature of speech representations compared to text -- preventing the emergence of a shared feature hierarchy within model layers. In this paper, we argue that this limitation can be addressed through late fusion and fission, with a fission process that accesses both high- and low-level features for speech generation. Our models implementing these principles, SmolTolk, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute, and achieve significantly improved cross-modal performance relative to early fusion/fission baselines. Representation analyses further suggest that our method enhances the model's ability to abstract higher-level, more semantic features from speech, and leads to increasingly shared representation spaces across layers.