CLMay 26, 2025

Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically

Ryan Soh-Eun Shim, Domenico De Cristofaro, Chengzhi Martin Hu, Alessandro Vietti, Barbara Plank

arXiv:2505.19606v110.95 citationsh-index: 4

Originality Incremental advance

AI Analysis

This work addresses the problem of understanding cross-lingual alignment mechanisms in speech models for researchers and practitioners, though it is incremental as it builds on prior work on spoken translation retrieval.

The study investigated whether cross-lingual alignment in speech foundation models occurs semantically rather than just phonetically, finding that spoken translation retrieval accuracy remains stable without phonetic cues and that the encoder contains both phonetic and semantic knowledge. Applying insights from early exiting improved speech recognition accuracy in seven low-resource languages unsupported by Whisper, especially for those with transparent orthographies.

Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs. Such an alignment has also been observed in speech foundation models. However, it remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech. Building on prior work on spoken translation retrieval, we perform pronunciation-controlled experiments to observe if cross-lingual alignment can indeed occur in such models on a semantic basis, instead of relying on phonetic similarities. Our findings indicate that even in the absence of phonetic cues, spoken translation retrieval accuracy remains relatively stable. We follow up with a controlled experiment on a word-level dataset of cross-lingual synonyms and near-homophones, confirming the existence of both phonetic and semantic knowledge in the encoder. Finally, we qualitatively examine the transcriptions produced by early exiting the encoder, where we observe that speech translation produces semantic errors that are characterized by phonetic similarities to corresponding words in the source language. We apply this insight from early exiting to speech recognition in seven low-resource languages unsupported by the Whisper model, and achieve improved accuracy in all languages examined, particularly for languages with transparent orthographies.

View on arXiv PDF

Similar