Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models
This addresses the usability gap for low-resource languages in audio processing, though it is incremental as it builds on existing multilingual foundations.
The paper tackled the problem of audio language models lacking cross-lingual abilities for low-resource languages like Thai, and the result was Typhoon-Audio, which outperforms existing open-source models and matches state-of-the-art Gemini-1.5-Pro in English and Thai.
Audio language models process audio inputs using textual prompts for tasks like speech recognition and audio captioning. Although built on multilingual pre-trained components, most are trained primarily on English, limiting their usability for other languages. This paper evaluates audio language models on Thai, a low-resource language, and finds that they lack emergent cross-lingual abilities despite their multilingual foundations. To address this, we explore data mixtures that optimize audio language models for both a target language and English while integrating audio comprehension and speech instruction-following into a unified model. Our experiments provide insights into improving instruction-following in low-resource languages by balancing language-specific and multilingual training data. The proposed model, Typhoon-Audio, significantly outperforms existing open-source models and achieves performance comparable to state-of-the-art Gemini-1.5-Pro in both English and Thai.