From Silent Signals to Natural Language: A Dual-Stage Transformer-LLM Approach
This work addresses the recognition and downstream processing of synthesized speech for silent speech interfaces, which is an incremental improvement in a specific domain.
The paper tackled the problem of phonetic ambiguity and noise in synthesized speech from silent speech interfaces by proposing an enhanced automatic speech recognition framework combining a transformer-based acoustic model with a large language model for post-processing, resulting in a 16% relative and 6% absolute reduction in word error rate over a 36% baseline.
Silent Speech Interfaces (SSIs) have gained attention for their ability to generate intelligible speech from non-acoustic signals. While significant progress has been made in advancing speech generation pipelines, limited work has addressed the recognition and downstream processing of synthesized speech, which often suffers from phonetic ambiguity and noise. To overcome these challenges, we propose an enhanced automatic speech recognition framework that combines a transformer-based acoustic model with a large language model (LLM) for post-processing. The transformer captures full utterance context, while the LLM ensures linguistic consistency. Experimental results show a 16% relative and 6% absolute reduction in word error rate (WER) over a 36% baseline, demonstrating substantial improvements in intelligibility for silent speech interfaces.