LatPhon: Lightweight Multilingual G2P for Romance Languages and English
This provides a compact, multilingual solution for speech processing tasks like TTS and ASR across Romance languages and English, though it is incremental as it builds on existing Transformer and G2P approaches.
The paper tackled grapheme-to-phoneme conversion for multiple Latin-script languages by developing LatPhon, a lightweight Transformer model with 7.5M parameters, achieving a mean phoneme error rate of 3.5% on the ipa-dict corpus, outperforming a baseline and approaching language-specific methods while using only 30 MB of memory.
Grapheme-to-phoneme (G2P) conversion is a key front-end for text-to-speech (TTS), automatic speech recognition (ASR), speech-to-speech translation (S2ST) and alignment systems, especially across multiple Latin-script languages.We present LatPhon, a 7.5 M - parameter Transformer jointly trained on six such languages--English, Spanish, French, Italian, Portuguese, and Romanian. On the public ipa-dict corpus, it attains a mean phoneme error rate (PER) of 3.5%, outperforming the byte-level ByT5 baseline (5.4%) and approaching language-specific WFSTs (3.2%) while occupying 30 MB of memory, which makes on-device deployment feasible when needed. These results indicate that compact multilingual G2P can serve as a universal front-end for Latin-language speech pipelines.