BIRB: A Generalization Benchmark for Information Retrieval in Bioacoustics
This provides a more realistic benchmark for evaluating ML generalization in bioacoustics, though it is incremental as it adapts existing methods to a new domain-specific problem.
The authors tackled the lack of realistic generalization benchmarks in machine learning by creating BIRB, a complex benchmark for bird vocalization retrieval that uses citizen science data for training and passive recordings for deployment, with their baseline system achieving competitive performance on this new benchmark.
The ability for a machine learning model to cope with differences in training and deployment conditions--e.g. in the presence of distribution shift or the generalization to new classes altogether--is crucial for real-world use cases. However, most empirical work in this area has focused on the image domain with artificial benchmarks constructed to measure individual aspects of generalization. We present BIRB, a complex benchmark centered on the retrieval of bird vocalizations from passively-recorded datasets given focal recordings from a large citizen science corpus available for training. We propose a baseline system for this collection of tasks using representation learning and a nearest-centroid search. Our thorough empirical evaluation and analysis surfaces open research directions, suggesting that BIRB fills the need for a more realistic and complex benchmark to drive progress on robustness to distribution shifts and generalization of ML models.