Towards Leveraging Sequential Structure in Animal Vocalizations
This work addresses the need for better feature representations in bioacoustics to leverage temporal information, but it is incremental as it adapts existing methods from speech processing to animal vocalizations.
The paper tackled the problem of capturing sequential structure in animal vocalizations, which is often discarded in computational bioacoustics, by using discrete acoustic token sequences from self-supervised speech models; results showed these sequences could discriminate call-types and callers across four datasets, with reasonable classification performance using k-Nearest Neighbour and Levenshtein distance.
Animal vocalizations contain sequential structures that carry important communicative information, yet most computational bioacoustics studies average the extracted frame-level features across the temporal axis, discarding the order of the sub-units within a vocalization. This paper investigates whether discrete acoustic token sequences, derived through vector quantization and gumbel-softmax vector quantization of extracted self-supervised speech model representations can effectively capture and leverage temporal information. To that end, pairwise distance analysis of token sequences generated from HuBERT embeddings shows that they can discriminate call-types and callers across four bioacoustics datasets. Sequence classification experiments using $k$-Nearest Neighbour with Levenshtein distance show that the vector-quantized token sequences yield reasonable call-type and caller classification performances, and hold promise as alternative feature representations towards leveraging sequential information in animal vocalizations.