AS AI CLOct 30, 2024

Phonology-Guided Speech-to-Speech Translation for African Languages

arXiv:2410.23323v3h-index: 1Speech Communication

Originality Highly original

AI Analysis

This work addresses speech-to-speech translation for low-resource African languages, offering a scalable, non-autoregressive approach with incremental improvements in alignment and model performance.

The paper tackles speech-to-speech translation for African languages by leveraging cross-linguistic pause synchrony, showing that within-phylum language pairs have 30-40% lower pause variance and over 3x higher onset/offset correlation. It introduces SPaDA, which improves alignment F1 by +3-4 points and reduces spurious matches by up to 38%, and SegUniDiff, which matches cascade BLEU at 30.3, reduces speaker error rate from 12.5% to 5.3%, and runs at an RTF of 1.02.

We present a prosody-guided framework for speech-to-speech translation (S2ST) that aligns and translates speech \emph{without} transcripts by leveraging cross-linguistic pause synchrony. Analyzing a 6{,}000-hour East African news corpus spanning five languages, we show that \emph{within-phylum} language pairs exhibit 30--40\% lower pause variance and over 3$\times$ higher onset/offset correlation compared to cross-phylum pairs. These findings motivate \textbf{SPaDA}, a dynamic-programming alignment algorithm that integrates silence consistency, rate synchrony, and semantic similarity. SPaDA improves alignment $F_1$ by +3--4 points and eliminates up to 38\% of spurious matches relative to greedy VAD baselines. Using SPaDA-aligned segments, we train \textbf{SegUniDiff}, a diffusion-based S2ST model guided by \emph{external gradients} from frozen semantic and speaker encoders. SegUniDiff matches an enhanced cascade in BLEU (30.3 on CVSS-C vs.\ 28.9 for UnitY), reduces speaker error rate (EER) from 12.5\% to 5.3\%, and runs at an RTF of 1.02. To support evaluation in low-resource settings, we also release a three-tier, transcript-free BLEU suite (M1--M3) that correlates strongly with human judgments. Together, our results show that prosodic cues in multilingual speech provide a reliable scaffold for scalable, non-autoregressive S2ST.

View on arXiv PDF

Similar