CLOct 2, 2025

Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models

arXiv:2510.02569v22 citationsh-index: 6
AI Analysis

This addresses the lack of understanding in how modality adapters work in spoken language models, which is incremental but clarifies a key component for researchers and developers.

The study investigated how modality adapters in spoken language models transform speech encoder outputs into representations for language models, finding that models with Whisper encoders use an English-based interlingua for meaning, enabling handling of unseen languages, while others like Phi-4-Multimodal-Instruct represent phonetics with English words.

Spoken language models (SLMs) that integrate speech with large language models (LMs) rely on modality adapters (MAs) to map the output of speech encoders to a representation that is understandable to the decoder LM. Yet we know very little about how these crucial MAs transform representations. Here we examine the MA output representation in three SLMs (SALMONN, Qwen2-Audio and Phi-4-Multimodal-Instruct). By finding the nearest decoder LM token to an MA representation, we uncover two strategies for MA representations. For models using a Whisper encoder, MAs appear to represent the meaning of the input using an English-based interlingua, allowing them to handle languages unseen in instruction tuning. For models that don't, like Phi-4-Multimodal-Instruct, MAs instead represent the phonetics of the input, but expressed with English words. We hypothesise that which arises depends on whether the speech encoder is trained only for speech recognition or also for translation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes