SD ASMay 15

Leveraging Local and Global Knowledge Integration with Time-Frequency Calibrated Distillation for Speech Enhancement

Jiaming Cheng, Ruiyu Liang, Ye Ni, Chao Xu, Jing Li, Wei Zhou, Rui Liu, Björn W. Schuller, Xiaoshuai Hao

arXiv:2506.131278.9

Predicted impact top 61% in SD · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the need for efficient speech enhancement models by providing a distillation method that effectively transfers knowledge from complex teacher models to lightweight student models, which is important for real-world deployment.

The paper proposes a novel knowledge distillation framework (I²SRF-TFCKD) for speech enhancement that leverages intra-set and inter-set recursive fusion with time-frequency calibrated distillation. The method consistently improves the performance of low-complexity student models on both single-channel and multi-channel datasets, outperforming other distillation schemes.

In this paper, we propose an intra-set and inter-set recursive fusion framework with time-frequency calibrated knowledge distillation (I$^2$SRF-TFCKD) for SE. Different from previous distillation strategies for SE, the proposed framework fully exploits the time-frequency differential information of speech while facilitating both local information focusing and global knowledge circulation. Firstly, we construct a collaborative distillation paradigm for intra-set and inter-set correlations. Within a correlated set, multi-layer teacher-student features are pairwise matched for calibrated distillation. Subsequently, we generate representative features from each correlated set through recursive fusion to form the fused feature set that enables inter-set knowledge interaction. Secondly, we propose a multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting, thus enabling refined allocation of distillation contributions across different layers according to speech characteristics. The proposed distillation strategy is applied to the dual-path dilated convolutional recurrent network (DPDCRN) that ranked first in the SE track of the L3DAS23 challenge. To evaluate the effectiveness of I$^2$SRF-TFCKD, we conduct experiments on both single-channel and multi-channel SE datasets. Objective evaluations demonstrate that the proposed KD strategy consistently and effectively improves the performance of the low-complexity student model and outperforms other distillation schemes.

View on arXiv PDF

Similar