CL AI SD ASApr 25, 2023

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao

arXiv:2304.12995v127.9399 citationsh-index: 83Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of enabling AI systems to conduct spoken conversations and process audio for users in AI and human-computer interaction, though it appears incremental as it builds on existing LLMs and foundation models.

The authors tackled the limitation of large language models in processing complex audio by proposing AudioGPT, a multi-modal system that integrates foundation models and input/output interfaces to handle speech, music, sound, and talking head tasks, enabling rich audio content creation in multi-round dialogues.

Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease. Our system is publicly available at \url{https://github.com/AIGC-Audio/AudioGPT}.

View on arXiv PDF Code

Similar