SDAICLASJun 15, 2024

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

arXiv:2406.10735v131 citations
Originality Incremental advance
AI Analysis

This work addresses the need for better audio tokenization methods to bridge audio and language processing, though it is incremental as it builds on existing semantic token approaches.

The paper tackled the problem of determining the optimal configuration for extracting discrete audio tokens from self-supervised learning models to preserve audio details like content and speaker identity, proposing a scalable vocoder and attention mechanism that improved performance in discriminative and generative tasks.

Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audio tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and Neural compression-based tokens (codecs). Although previous studies have benchmarked codec models to identify optimal configurations, the ideal setup for quantizing pretrained SSL models remains unclear. This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks. We propose a scalable solution to train a universal vocoder across multiple SSL layers. Furthermore, an attention mechanism is employed to identify task-specific influential layers, enhancing the adaptability and performance of semantic tokens in diverse audio applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes