SDAIAug 13, 2025

No Free Lunch from Audio Pretraining in Bioacoustics: A Benchmark Study of Embeddings

arXiv:2508.10230v11 citationsh-index: 5Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of unreliable embeddings from audio-pretrained models in bioacoustics for researchers, showing incremental improvements in benchmarking and fine-tuning practices.

This study benchmarks 11 deep learning models on bioacoustic tasks, finding that audio-pretrained models without fine-tuning underperform fine-tuned AlexNet and fail to separate background from labeled sounds, while ResNet succeeds, highlighting the need for fine-tuning and embedding checks.

Bioacoustics, the study of animal sounds, offers a non-invasive method to monitor ecosystems. Extracting embeddings from audio-pretrained deep learning (DL) models without fine-tuning has become popular for obtaining bioacoustic features for tasks. However, a recent benchmark study reveals that while fine-tuned audio-pretrained VGG and transformer models achieve state-of-the-art performance in some tasks, they fail in others. This study benchmarks 11 DL models on the same tasks by reducing their learned embeddings' dimensionality and evaluating them through clustering. We found that audio-pretrained DL models 1) without fine-tuning even underperform fine-tuned AlexNet, 2) both with and without fine-tuning fail to separate the background from labeled sounds, but ResNet does, and 3) outperform other models when fewer background sounds are included during fine-tuning. This study underscores the necessity of fine-tuning audio-pretrained models and checking the embeddings after fine-tuning. Our codes are available: https://github.com/NeuroscienceAI/Audio\_Embeddings

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes