CLOct 14, 2024

JOOCI: a Framework for Learning Comprehensive Speech Representations

arXiv:2410.11086v3h-index: 44
Originality Incremental advance
AI Analysis

This addresses the problem of inefficient speech representation learning for tasks requiring both linguistic and paralinguistic features, representing an incremental advance over existing methods.

The paper tackles the suboptimal layer-wise division in self-supervised speech models by proposing JOOCI, a method that jointly optimizes for both content and other information, achieving a 26.5% improvement over WavLM on speaker recognition and language tasks from the SUPERB benchmark.

Information in speech can be categorized into two groups: Content (what is being said, such as linguistics) and Other (how it is expressed such as information about speaker and paralinguistic features). Current self-supervised learning (SSL) methods are shown to divide the model's representational-depth or layers in two, with earlier layers specializing in Other and later layers in Content related tasks. This layer-wise division is inherently sub-optimal, as neither information type can use all layers to build hierarchical representations. To address this, we propose JOOCI, a novel speech representation learning method that does not compromise on the representational-depth for either information type. JOOCI outperforms WavLM by 26.5%, and other models of similar size (100M parameters), when evaluated on two speaker recognition and two language tasks from the SUPERB benchmark, demonstrating its effectiveness in Jointly Optimizing Other and Content Information (JOOCI).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes