CLApr 2

Generation-Step-Aware Framework for Cross-Modal Representation and Control in Multilingual Speech-Text Models

Toshiki Nakai, Varsha Suresh, Vera Demberg

arXiv:2601.1738788.0h-index: 2

Predicted impact top 40% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This provides a nuanced view of multilingual computation in speech-text models, which is incremental as it refines existing understanding without broad new applications.

The study tackled the problem of understanding cross-modal language alignment in multilingual speech-text models by introducing a generation-step-aware framework to evaluate shared computation, finding that language-representation neurons are shared at early decoding steps but weaken later, while language-control neurons transfer across modalities and strengthen over time.

Multilingual speech-text models rely on cross-modal language alignment to transfer knowledge between speech and text, but it remains unclear whether this reflects shared computation for the same language or modality-specific processing. We introduce a generation-step-aware framework for evaluating cross-modal computation that (i) identifies language-selective neurons for each modality at different decoding steps, (ii) decomposes them into language-representation and language-control roles, and (iii) enables cross-modal comparison via overlap measures and causal intervention, including cross-modal steering of output language. Applying our framework to SeamlessM4T v2, we find that cross-modal language alignment is strongest at the first decoding step, where language-representation neurons are shared across modalities, but weakens as generation proceeds, indicating a shift toward modality-specific autoregressive processing. In contrast, language-control neurons identified from speech transfer causally to text generation, revealing partially shared circuitry for output-language control that strengthens at later decoding steps. These results show that cross-modal processing is both time- and function-dependent, providing a more nuanced view of multilingual computation in speech-text models.

View on arXiv PDF

Similar