CL SDOct 3, 2025

Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?

Oriol Pareras, Gerard I. Gállego, Federico Costa, Cristina España-Bonet, Javier Hernando

arXiv:2510.03093v14.91 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work addresses speech translation efficiency for language processing applications, but it is incremental as it builds on existing LLM-based methods.

The paper tackles the problem of Speech-to-Text Translation (S2TT) by comparing Chain-of-Thought (CoT) and Direct prompting strategies, finding that Direct prompting improves more consistently with increasing data, suggesting it may become more effective as S2TT resources grow.

Recent work on Speech-to-Text Translation (S2TT) has focused on LLM-based models, introducing the increasingly adopted Chain-of-Thought (CoT) prompting, where the model is guided to first transcribe the speech and then translate it. CoT typically outperforms direct prompting primarily because it can exploit abundant Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) datasets to explicitly model its steps. In this paper, we systematically compare CoT and Direct prompting under increasing amounts of S2TT data. To this end, we pseudo-label an ASR corpus by translating its transcriptions into six European languages, and train LLM-based S2TT systems with both prompting strategies at different data scales. Our results show that Direct improves more consistently as the amount of data increases, suggesting that it may become a more effective approach as larger S2TT resources are created.

View on arXiv PDF

Similar