WavLink: Compact Audio-Text Embeddings with a Global Whisper Token

Gokul Karthik Kumar, Ludovick Lepauloux, Hakim Hacid

arXiv:2601.15118v22.2h-index: 6

Originality Incremental advance

AI Analysis

This work addresses the need for efficient audio-text embedding models for retrieval and classification tasks, though it is incremental as it builds on existing Whisper and CLAP-based methods.

The paper tackles the problem of creating compact audio-text embeddings by augmenting the Whisper encoder with a learnable global token, achieving state-of-the-art retrieval performance and enabling 8x smaller embeddings with minimal performance drop.

Whisper has become the de-facto encoder for extracting general-purpose audio features in large audio-language models, where a 30-second clip is typically represented by 1500 frame features projected into an LLM. In contrast, audio-text embedding models like CLAP-based models have largely relied on alternative audio encoders (e.g., HTS-AT, PaSST), and have not leveraged Whisper effectively. We present WavLink, a compact audio-text embedding model that augments Whisper encoder with a learnable global token, trained jointly with a text encoder. Through a systematic study of design choices, including pretrained text encoders, loss functions, training modes, and data mixtures, we identify configurations that yield state-of-the-art retrieval performance. Our two-stage training recipe across three model sizes, combined with Matryoshka-style supervision, improves scalability, enabling 8x smaller embeddings with minimal performance drop. WavLink also demonstrates competitive performance on AIR-Bench with MCQs and zero-shot classification.

View on arXiv PDF

Similar