Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition
This work improves multilingual speech recognition by enhancing phoneme-to-grapheme conversion, though it is incremental as it builds on existing LLM-based methods.
The paper tackled multilingual phoneme-to-grapheme conversion for speech recognition by addressing language-aware generation and data imbalance, reducing the average word error rate from 10.56% to 7.66% on a ten-language benchmark.
Phoneme-based ASR factorizes recognition into speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G), enabling cross-lingual acoustic sharing while keeping language-specific orthography in a separate module. While large language models (LLMs) are promising for P2G, multilingual P2G remains challenging due to language-aware generation and severe cross-language data imbalance. We study multilingual LLM-based P2G on the ten-language CV-Lang10 benchmark. We examine robustness strategies that account for S2P uncertainty, including DANP and Simplified SKM (S-SKM). S-SKM is a Monte Carlo approximation that avoids CTC-based S2P probability weighting in P2G training. Robust training and low-resource oversampling reduce the average WER from 10.56% to 7.66%.