SDAIFeb 25

UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation

arXiv:2602.21772v1h-index: 3
Originality Incremental advance
AI Analysis

This addresses the need for robust, multi-domain audio encoders for applications in AI and audio processing, though it is incremental as it builds on existing training paradigms.

The paper tackles the problem of creating a universal audio representation that works across speech, environmental sounds, and music by proposing UniWhisper, an efficient continual multi-task training framework, which achieved normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN on 20 tasks, outperforming Whisper's 0.64 and 0.46.

A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes