ASIRMay 4

Multi-Axis Speech Similarity via Factor-Partitioned Embeddings

arXiv:2605.0280418.8
AI Analysis

For speech retrieval tasks, this framework allows fine-grained control over similarity computation by weighting different attributes, addressing the problem of conflated representations in single-vector embeddings.

Speech encodes multiple simultaneous attributes that conventional single-vector embeddings conflate. The proposed factor-partitioned embedding framework maps each utterance into a single vector with subspaces for distinct axes, enabling attribute-conditioned retrieval that suppresses same-speaker bias and surfaces semantically matched utterances across recording conditions.

Speech encodes multiple simultaneous attributes--linguistic content, speaker identity, dialect, gender--that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how --or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes