Jointly Learning to Align and Convert Graphemes to Phonemes with Neural Attention Models
This work addresses a key problem in speech and language processing for applications like text-to-speech, though it is incremental as it builds on existing attention mechanisms.
The authors tackled grapheme-to-phoneme conversion by proposing an attention-enabled encoder-decoder model that jointly learns alignments and conversions, achieving state-of-the-art results on three standard datasets (CMUDict, Pronlex, and NetTalk).
We propose an attention-enabled encoder-decoder model for the problem of grapheme-to-phoneme conversion. Most previous work has tackled the problem via joint sequence models that require explicit alignments for training. In contrast, the attention-enabled encoder-decoder model allows for jointly learning to align and convert characters to phonemes. We explore different types of attention models, including global and local attention, and our best models achieve state-of-the-art results on three standard data sets (CMUDict, Pronlex, and NetTalk).