PodAgent: A Comprehensive Framework for Podcast Generation
This addresses the challenge of creating engaging and expressive podcasts for content creators, though it appears incremental as it builds on existing audio generation and LLM methods.
The paper tackles the problem of generating podcast-like audio programs by proposing PodAgent, a framework that uses multi-agent collaboration for content generation, voice-role matching, and LLM-enhanced speech synthesis, resulting in an 87.4% voice-matching accuracy and outperforming GPT-4 in dialogue content.
Existing Existing automatic audio generation methods struggle to generate podcast-like audio programs effectively. The key challenges lie in in-depth content generation, appropriate and expressive voice production. This paper proposed PodAgent, a comprehensive framework for creating audio programs. PodAgent 1) generates informative topic-discussion content by designing a Host-Guest-Writer multi-agent collaboration system, 2) builds a voice pool for suitable voice-role matching and 3) utilizes LLM-enhanced speech synthesis method to generate expressive conversational speech. Given the absence of standardized evaluation criteria for podcast-like audio generation, we developed comprehensive assessment guidelines to effectively evaluate the model's performance. Experimental results demonstrate PodAgent's effectiveness, significantly surpassing direct GPT-4 generation in topic-discussion dialogue content, achieving an 87.4% voice-matching accuracy, and producing more expressive speech through LLM-guided synthesis. Demo page: https://podcast-agent.github.io/demo/. Source code: https://github.com/yujxx/PodAgent.