CLSDOct 3, 2025

Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation

arXiv:2510.03115v1h-index: 9
Originality Synthesis-oriented
AI Analysis

This is an incremental study that addresses limitations in S2TT systems for researchers and practitioners by revealing shortcomings in current CoT methods.

The paper tackled the problem of Chain-of-Thought prompting in Speech-to-Text Translation not effectively leveraging speech cues, finding it largely relies on transcripts, and showed that simple training interventions like adding Direct S2TT data or noisy transcript injection enhance robustness and increase speech attribution.

Speech-to-Text Translation (S2TT) systems built from Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major limitations: error propagation and the inability to exploit prosodic or other acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues. Analyzing CoT through attribution methods, robustness evaluations with corrupted transcripts, and prosody-awareness, we find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech. Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution. These findings challenge the assumed advantages of CoT and highlight the need for architectures that explicitly integrate acoustic information into translation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes