SDAILGMMASMar 18, 2023

Content Adaptive Front End For Audio Classification

arXiv:2303.10446v3h-index: 25
Originality Incremental advance
AI Analysis

This work addresses the need for more flexible audio processing methods in machine learning, though it appears incremental as it builds on existing learnable front-end approaches.

The authors tackled the problem of creating a learnable front end for audio classification by proposing a content-adaptive time-frequency representation, which improved performance on tasks like acoustic scene classification and audio tagging compared to fixed front ends.

We propose a learnable content adaptive front end for audio signal processing. Before the modern advent of deep learning, we used fixed representation non-learnable front-ends like spectrogram or mel-spectrogram with/without neural architectures. With convolutional architectures supporting various applications such as ASR and acoustic scene understanding, a shift to a learnable front ends occurred in which both the type of basis functions and the weight were learned from scratch and optimized for the particular task of interest. With the shift to transformer-based architectures with no convolutional blocks present, a linear layer projects small waveform patches onto a small latent dimension before feeding them to a transformer architecture. In this work, we propose a way of computing a content-adaptive learnable time-frequency representation. We pass each audio signal through a bank of convolutional filters, each giving a fixed-dimensional vector. It is akin to learning a bank of finite impulse-response filterbanks and passing the input signal through the optimum filter bank depending on the content of the input signal. A content-adaptive learnable time-frequency representation may be more broadly applicable, beyond the experiments in this paper.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes