On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis
This work addresses the utility of foundation models for marmoset call analysis, which is incremental as it builds on existing methods for neuro-biological research.
This study tackled the problem of evaluating speech and audio foundation models for analyzing marmoset calls, finding that higher bandwidth models improved performance and pre-training on speech or general audio yielded comparable results, outperforming a spectral baseline.
Marmoset monkeys encode vital information in their calls and serve as a surrogate model for neuro-biologists to understand the evolutionary origins of human vocal communication. Traditionally analyzed with signal processing-based features, recent approaches have utilized self-supervised models pre-trained on human speech for feature extraction, capitalizing on their ability to learn a signal's intrinsic structure independently of its acoustic domain. However, the utility of such foundation models remains unclear for marmoset call analysis in terms of multi-class classification, bandwidth, and pre-training domain. This study assesses feature representations derived from speech and general audio domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset call-type and caller classification tasks. Results show that models with higher bandwidth improve performance, and pre-training on speech or general audio yields comparable results, improving over a spectral baseline.