Asca: less audio data is more insightful
This addresses challenges in audio recognition for specialized areas like birdsong and submarine acoustics, where data scarcity is a bottleneck, though it appears incremental as it builds on existing hybrid architectures.
The paper tackled the problem of audio recognition in specialized domains with limited data by introducing the Audio Spectrogram Convolution Attention (ASCA) model, which achieved accuracies of 81.2% on BirdCLEF2023 and 35.1% on AudioSet(Balanced), significantly outperforming competing methods.
Audio recognition in specialized areas such as birdsong and submarine acoustics faces challenges in large-scale pre-training due to the limitations in available samples imposed by sampling environments and specificity requirements. While the Transformer model excels in audio recognition, its dependence on vast amounts of data becomes restrictive in resource-limited settings. Addressing this, we introduce the Audio Spectrogram Convolution Attention (ASCA) based on CoAtNet, integrating a Transformer-convolution hybrid architecture, novel network design, and attention techniques, further augmented with data enhancement and regularization strategies. On the BirdCLEF2023 and AudioSet(Balanced), ASCA achieved accuracies of 81.2% and 35.1%, respectively, significantly outperforming competing methods. The unique structure of our model enriches output, enabling generalization across various audio detection tasks. Our code can be found at https://github.com/LeeCiang/ASCA.