MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors
This work is significant for individuals with communication impairments, offering a promising step towards non-invasive speech brain-computer interfaces, and also contributes to auditory neuroscience research by providing a new tool to probe human auditory perception.
This paper addresses the challenge of reconstructing intelligible speech from noisy non-invasive neural signals (EEG and MEG). The authors introduce MindVoice, a framework that leverages pretrained models to recover both high-level semantic content and fine-grained acoustic attributes from neural recordings, leading to substantially improved speech intelligibility compared to existing methods.
Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.