Differentiable Time-Frequency Scattering on GPU
This work addresses the problem of integrating biologically plausible audio models into standard evaluation tools for researchers in audio generation and perception, though it is incremental as it improves upon existing methods.
The paper tackled the limitations of prior joint time-frequency scattering implementations by creating a differentiable, fast, and flexible version in Python that works with multiple backends on CPU and GPU, demonstrating its utility in applications like unsupervised manifold learning, supervised classification, and texture resynthesis.
Joint time-frequency scattering (JTFS) is a convolutional operator in the time-frequency domain which extracts spectrotemporal modulations at various rates and scales. It offers an idealized model of spectrotemporal receptive fields (STRF) in the primary auditory cortex, and thus may serve as a biological plausible surrogate for human perceptual judgments at the scale of isolated audio events. Yet, prior implementations of JTFS and STRF have remained outside of the standard toolkit of perceptual similarity measures and evaluation methods for audio generation. We trace this issue down to three limitations: differentiability, speed, and flexibility. In this paper, we present an implementation of time-frequency scattering in Python. Unlike prior implementations, ours accommodates NumPy, PyTorch, and TensorFlow as backends and is thus portable on both CPU and GPU. We demonstrate the usefulness of JTFS via three applications: unsupervised manifold learning of spectrotemporal modulations, supervised classification of musical instruments, and texture resynthesis of bioacoustic sounds.