SDSep 13, 2024
Biomimetic Frontend for Differentiable Audio ProcessingRuolan Leslie Famularo, Dmitry N. Zotkin, Shihab A. Shamma et al.
While models in audio and speech processing are becoming deeper and more end-to-end, they as a consequence need expensive training on large data, and are often brittle. We build on a classical model of human hearing and make it differentiable, so that we can combine traditional explainable biomimetic signal processing approaches with deep-learning frameworks. This allows us to arrive at an expressive and explainable model that is easily trained on modest amounts of data. We apply this model to audio processing tasks, including classification and enhancement. Results show that our differentiable model surpasses black-box approaches in terms of computational efficiency and robustness, even with little training data. We also discuss other potential applications.
SDFeb 11, 2015
Gaussian Process Models for HRTF based Sound-Source Localization and Active-LearningYuancheng Luo, Dmitry N. Zotkin, Ramani Duraiswami
From a machine learning perspective, the human ability localize sounds can be modeled as a non-parametric and non-linear regression problem between binaural spectral features of sound received at the ears (input) and their sound-source directions (output). The input features can be summarized in terms of the individual's head-related transfer functions (HRTFs) which measure the spectral response between the listener's eardrum and an external point in $3$D. Based on these viewpoints, two related problems are considered: how can one achieve an optimal sampling of measurements for training sound-source localization (SSL) models, and how can SSL models be used to infer the subject's HRTFs in listening tests. First, we develop a class of binaural SSL models based on Gaussian process regression and solve a \emph{forward selection} problem that finds a subset of input-output samples that best generalize to all SSL directions. Second, we use an \emph{active-learning} approach that updates an online SSL model for inferring the subject's SSL errors via headphones and a graphical user interface. Experiments show that only a small fraction of HRTFs are required for $5^{\circ}$ localization accuracy and that the learned HRTFs are localized closer to their intended directions than non-individualized HRTFs.
SDFeb 11, 2015
Sparse Head-Related Impulse Response for Efficient Direct ConvolutionYuancheng Luo, Dmitry N. Zotkin, Ramani Duraiswami
Head-related impulse responses (HRIRs) are subject-dependent and direction-dependent filters used in spatial audio synthesis. They describe the scattering response of the head, torso, and pinnae of the subject. We propose a structural factorization of the HRIRs into a product of non-negative and Toeplitz matrices; the factorization is based on a novel extension of a non-negative matrix factorization algorithm. As a result, the HRIR becomes expressible as a convolution between a direction-independent \emph{resonance} filter and a direction-dependent \emph{reflection} filter. Further, the reflection filter can be made \emph{sparse} with minimal HRIR distortion. The described factorization is shown to be applicable to the arbitrary source signal case and allows one to employ time-domain convolution at a computational cost lower than using convolution in the frequency domain.