SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation
This work addresses the problem of real-time, high-quality speech translation for multilingual applications, representing an incremental advance by integrating syntactic parsing into existing LLM-based systems.
The paper tackled simultaneous speech translation by proposing a syntax-aware chunking strategy to segment input streams into coherent units, and introduced SASST, an end-to-end framework integrating Whisper and LLMs that dynamically outputs translations or wait symbols. Experiments on CoVoST2 showed significant translation quality improvements across multiple languages.
This work proposes a grammar-based chunking strategy that segments input streams into semantically complete units by parsing dependency relations (e.g., noun phrase boundaries, verb-object structures) and punctuation features. The method ensures chunk coherence and minimizes semantic fragmentation. Building on this mechanism, we present SASST (Syntax-Aware Simultaneous Speech Translation), an end-to-end framework integrating frozen Whisper encoder and decoder-only LLM. The unified architecture dynamically outputs translation tokens or <WAIT> symbols to jointly optimize translation timing and content, with target-side reordering addressing word-order divergence. Experiments on CoVoST2 multilingual corpus En-{De, Zh, Ja} demonstrate significant translation quality improvements across languages and validate the effectiveness of syntactic structures in LLM-driven SimulST systems.