SLM-SS: Speech Language Model for Generative Speech Separation
This addresses the issue of poor speech intelligibility in separated signals for downstream tasks like speech recognition, representing an incremental improvement over existing neural network-based methods.
The paper tackled the problem of speech separation by proposing SLM-SS, a method using speech language models to enhance intelligibility and coherence in separated signals, showing significantly better preservation of speech intelligibility on the LibriMix dataset.
Speech separation (SS) has advanced significantly with neural network-based methods, showing improved performance on signal-level metrics. However, these methods often struggle to maintain speech intelligibility in the separated signals, which can negatively affect the performance of downstream tasks such as speech recognition. In this work, we propose SLM-SS, a novel approach that applies speech language models to SS, aiming to enhance the intelligibility and coherence of the separated signals. We frame SS as discrete multi-codebook sequence generation, using Encoder-Decoder models to map quantized speech mixtures to target tokens. In addition to the autoregressive modeling strategy, we introduce a non-autoregressive model to improve decoding efficiency for residual tokens. Experimental results on the LibriMix dataset demonstrate that our approach shows significantly better preservation of speech intelligibility, leading to improved linguistic consistency in a variety of downstream tasks compared to existing approaches.