CLSDASNov 14, 2023

The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language

arXiv:2311.08323v235 citationsh-index: 3
Originality Highly original
AI Analysis

This work addresses the challenge of open-vocabulary keyword spotting and forced alignment for any language, offering a novel approach with broad applicability in multilingual speech technology.

The researchers tackled the problem of crosslinguistic generalizability in speech processing by developing phoneme-based models using a multilingual corpus, achieving strong performance on 95 unseen languages and enabling zero-shot forced alignment.

In this project, we demonstrate that phoneme-based models for speech processing can achieve strong crosslinguistic generalizability to unseen languages. We curated the IPAPACK, a massively multilingual speech corpora with phonemic transcriptions, encompassing more than 115 languages from diverse language families, selectively checked by linguists. Based on the IPAPACK, we propose CLAP-IPA, a multi-lingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between arbitrary speech signals and phonemic sequences. The proposed model was tested on 95 unseen languages, showing strong generalizability across languages. Temporal alignments between phonemes and speech signals also emerged from contrastive training, enabling zeroshot forced alignment in unseen languages. We further introduced a neural forced aligner IPA-ALIGNER by finetuning CLAP-IPA with the Forward-Sum loss to learn better phone-to-audio alignment. Evaluation results suggest that IPA-ALIGNER can generalize to unseen languages without adaptation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes