LGSDASMay 23, 2023

Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers?

arXiv:2305.14035v320 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of cross-domain transferability in bio-acoustics for researchers, though it is incremental as it applies existing methods to new data.

The paper investigated whether self-supervised learning models pre-trained on human speech can distinguish individual Marmoset callers without fine-tuning, and found that the embedding spaces successfully carried meaningful caller information.

Self-supervised learning (SSL) models use only the intrinsic structure of a given signal, independent of its acoustic domain, to extract essential information from the input to an embedding space. This implies that the utility of such representations is not limited to modeling human speech alone. Building on this understanding, this paper explores the cross-transferability of SSL neural representations learned from human speech to analyze bio-acoustic signals. We conduct a caller discrimination analysis and a caller detection study on Marmoset vocalizations using eleven SSL models pre-trained with various pretext tasks. The results show that the embedding spaces carry meaningful caller information and can successfully distinguish the individual identities of Marmoset callers without fine-tuning. This demonstrates that representations pre-trained on human speech can be effectively applied to the bio-acoustics domain, providing valuable insights for future investigations in this field.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes