Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes
This work addresses the problem of efficient multilingual speech processing for languages with large vocabularies, offering a scalable alternative to traditional units like characters or sub-words.
The authors tackled the challenge of scaling multilingual speech recognition and synthesis by modeling text as sequences of Unicode bytes, avoiding large softmaxes and enabling shared representations across languages. They demonstrated that byte-based models outperform character-based ones in monolingual speech recognition, achieve a 4.4% relative improvement on average in multilingual settings, and show a 38.6% relative gain in code-switching scenarios, while matching monolingual performance in synthesis.
We present two end-to-end models: Audio-to-Byte (A2B) and Byte-to-Audio (B2A), for multilingual speech recognition and synthesis. Prior work has predominantly used characters, sub-words or words as the unit of choice to model text. These units are difficult to scale to languages with large vocabularies, particularly in the case of multilingual processing. In this work, we model text via a sequence of Unicode bytes, specifically, the UTF-8 variable length byte sequence for each character. Bytes allow us to avoid large softmaxes in languages with large vocabularies, and share representations in multilingual models. We show that bytes are superior to grapheme characters over a wide variety of languages in monolingual end-to-end speech recognition. Additionally, our multilingual byte model outperform each respective single language baseline on average by 4.4% relatively. In Japanese-English code-switching speech, our multilingual byte model outperform our monolingual baseline by 38.6% relatively. Finally, we present an end-to-end multilingual speech synthesis model using byte representations which matches the performance of our monolingual baselines.