ASCLSDJun 22, 2023

Implicit spoken language diarization

arXiv:2306.12913v11 citationsh-index: 30
Originality Synthesis-oriented
AI Analysis

This work addresses language diarization for speech processing, but it is incremental as it adapts existing speaker diarization frameworks to a related task.

The paper tackled spoken language diarization by exploring implicit modeling using deep embeddings instead of explicit phonotactic methods, achieving diarization error rates of 6.78% and 7.06% on synthetic data and 22.50% and 60.38% on practical data, with pre-trained wave2vec embeddings providing a 30.74% relative improvement in JER.

Spoken language diarization (LD) and related tasks are mostly explored using the phonotactic approach. Phonotactic approaches mostly use explicit way of language modeling, hence requiring intermediate phoneme modeling and transcribed data. Alternatively, the ability of deep learning approaches to model temporal dynamics may help for the implicit modeling of language information through deep embedding vectors. Hence this work initially explores the available speaker diarization frameworks that capture speaker information implicitly to perform LD tasks. The performance of the LD system on synthetic code-switch data using the end-to-end x-vector approach is 6.78% and 7.06%, and for practical data is 22.50% and 60.38%, in terms of diarization error rate and Jaccard error rate (JER), respectively. The performance degradation is due to the data imbalance and resolved to some extent by using pre-trained wave2vec embeddings that provide a relative improvement of 30.74% in terms of JER.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes