CL SD ASMay 30, 2025

Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios

Gerard I. Gállego, Oriol Pareras, Martí Cortada Garcia, Lucas Takanori, Javier Hernando

arXiv:2505.24691v19.64 citationsh-index: 9INTERSPEECH

Originality Incremental advance

AI Analysis

This work addresses the challenge of making speech-to-text translation more accessible across diverse languages, particularly for low-resource scenarios, but it is incremental as it builds on existing multilingual LLM and CoT methods.

The paper tackled the problem of speech-to-text translation in low-resource and zero-resource settings by integrating phoneme representations into a Chain-of-Thought framework, resulting in improved translation quality for low-resource conditions and enabling zero-resource translation, though with a slight impact on high-resource performance.

We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource settings. By introducing phoneme recognition as an intermediate step, we enhance cross-lingual transfer, enabling translation even for languages with no labeled speech data. Our system builds on a multilingual LLM, which we extend to process speech and phonemes. Training follows a curriculum learning strategy that progressively introduces more complex tasks. Experiments on multilingual S2TT benchmarks show that phoneme-augmented CoT improves translation quality in low-resource conditions and enables zero-resource translation, while slightly impacting high-resource performance. Despite this trade-off, our findings demonstrate that phoneme-based CoT is a promising step toward making S2TT more accessible across diverse languages.

View on arXiv PDF

Similar