CL AINov 12, 2023

AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs

Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Ke Li, Junteng Jia, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer

arXiv:2311.06753v216.872 citationsh-index: 21

Originality Incremental advance

AI Analysis

This work addresses the need for LLMs with broad speech abilities for users in conversational AI and multimodal applications, though it builds incrementally on existing instruction-tuned models.

The authors tackled the problem of extending large language models (LLMs) to handle general-purpose speech processing without curated paired data, resulting in AudioChatLlama, which performs on par with or outperforms cascaded systems on speech QA tasks and enables cross-modal interchange.

In this work, we extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities while maintaining the wide range of original LLM capabilities, without using any carefully curated paired data. The resulting end-to-end model, named AudioChatLlama, can utilize audio prompts as a replacement for text and sustain a conversation. Such a model also has extended cross-modal capabilities such as being able to perform spoken question answering (QA), speech translation, and audio summarization amongst many other closed and open-domain tasks. This is unlike prior approaches in speech, in which LLMs are extended to handle audio for a limited number of pre-designated tasks. On both synthesized and recorded speech QA test sets, evaluations show that our end-to-end approach is on par with or outperforms cascaded systems (speech recognizer + LLM) in terms of modeling the response to a prompt. Furthermore, unlike cascades, our approach can interchange text and audio modalities and intrinsically utilize prior context in a conversation to provide better results.

View on arXiv PDF

Similar