Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics?
This addresses a gap in understanding for applications like music information retrieval and text-guided music generation, but it is incremental as it focuses on evaluating existing models rather than proposing new ones.
The paper tackled the problem of evaluating whether joint language-audio embedding models capture human-perceived timbre semantics, finding that LAION-CLAP consistently provided the most reliable alignment across instrumental sounds and audio effects.
Understanding and modeling the relationship between language and sound is critical for applications such as music information retrieval,text-guided music generation, and audio captioning. Central to these tasks is the use of joint language-audio embedding spaces, which map textual descriptions and auditory content into a shared embedding space. While multimodal embedding models such as MS-CLAP, LAION-CLAP, and MuQ-MuLan have shown strong performance in aligning language and audio, their correspondence to human perception of timbre, a multifaceted attribute encompassing qualities such as brightness, roughness, and warmth, remains underexplored. In this paper, we evaluate the above three joint language-audio embedding models on their ability to capture perceptual dimensions of timbre. Our findings show that LAION-CLAP consistently provides the most reliable alignment with human-perceived timbre semantics across both instrumental sounds and audio effects.