Deep Networks tag the location of bird vocalisations on audio spectrograms
This work addresses the need for automated acoustic monitoring of bird species in conservation and research, though it appears incremental as it adapts existing deep learning techniques to this domain-specific task.
The paper tackles the problem of detecting and segmenting bird vocalizations in field recordings by proposing two deep learning approaches: using DenseNets with attention maps and YOLO v2 for localization, and a U-net autoencoder to generate binary masks for spectral blobs. The methods aim to automate analysis of large audio datasets with minimal human intervention, potentially aiding biodiversity monitoring and policy-making.
This work focuses on reliable detection and segmentation of bird vocalizations as recorded in the open field. Acoustic detection of avian sounds can be used for the automatized monitoring of multiple bird taxa and querying in long-term recordings for species of interest. These tasks are tackled in this work, by suggesting two approaches: A) First, DenseNets are applied to weekly labeled data to infer the attention map of the dataset (i.e. Salience and CAM). We push further this idea by directing attention maps to the YOLO v2 Deepnet-based, detection framework to localize bird vocalizations. B) A deep autoencoder, namely the U-net, maps the audio spectrogram of bird vocalizations to its corresponding binary mask that encircles the spectral blobs of vocalizations while suppressing other audio sources. We focus solely on procedures requiring minimum human attendance, suitable to scan massive volumes of data, in order to analyze them, evaluate insights and hypotheses and identify patterns of bird activity. Hopefully, this approach will be valuable to researchers, conservation practitioners, and decision makers that need to design policies on biodiversity issues.