Monaural source enhancement maximizing source-to-distortion ratio via automatic differentiation
This addresses source enhancement quality for audio processing applications, but appears incremental as it modifies the objective function rather than introducing a fundamentally new approach.
The paper tackled monaural source enhancement by proposing to use signal-to-distortion ratio (SDR) as the objective function instead of conventional similarity metrics like L1/L2 norms or STOI, since SDR better reflects noise reduction quality. Experimental results showed the proposed method achieved better performance than conventional methods.
Recently, deep neural network (DNN) has made a breakthrough in monaural source enhancement. Through a training step by using a large amount of data, DNN estimates a mapping between mixed signals and clean signals. At this time, we use an objective function that numerically expresses the quality of a mapping by DNN. In the conventional methods, L1 norm, L2 norm, and Itakura-Saito divergence are often used as objective functions. Recently, an objective function based on short-time objective intelligibility (STOI) has also been proposed. However, these functions only indicate similarity between the clean signal and the estimated signal by DNN. In other words, they do not show the quality of noise reduction or source enhancement. Motivated by the fact, this paper adopts signal-to-distortion ratio (SDR) as the objective function. Since SDR virtually shows signal-to-noise ratio (SNR), maximizing SDR solves the above problem. The experimental results revealed that the proposed method achieved better performance than the conventional methods.