Multi-layer attentive probing improves transfer of audio representations for bioacoustics
For researchers evaluating audio representation learning in bioacoustics, this work reveals that probe design biases benchmark results, advocating for more informative evaluation protocols.
The paper shows that using multi-layer attention probes instead of standard last-layer linear probes significantly improves downstream task performance on bioacoustic benchmarks (BEANs and BirdSet), suggesting current benchmarks may misrepresent encoder quality.
Probing heads map the representations learned from audio by a machine learning model to downstream task labels and are a key component in evaluating representation learning. Most bioacoustic benchmarks use a fixed, low-capacity probe, such as a linear layer on the final encoder layer. While this standardization enables model comparisons, it may bias results by overlooking the interaction between encoder features and probe design. In this work, we systematically study different probing strategies across two bioacoustic benchmarks, BEANs and BirdSet. We evaluate last- and multi-layer probing, across linear and attention probes. We show that larger probe heads that leverage time information have superior performance. Our results suggest that current benchmarks may misrepresent encoder quality when relying on a last-layer probing setup. Multi-layer probing improves downstream task performance across all tested models, while attention probing has superior performance to linear probing for transformer models.