PodBench: A Comprehensive Benchmark for Instruction-Aware Audio-Oriented Podcast Script Generation
This provides a reproducible testbed for researchers working on long-form, audio-centric generation tasks, though it is incremental as it focuses on benchmarking rather than novel method development.
The authors tackled the lack of systematic evaluation resources for podcast script generation by introducing PodBench, a benchmark with 800 samples and complex instructions, and found that open-source models with explicit reasoning outperform standard baselines in robustness for long contexts and multi-speaker coordination, though high instruction following does not always correlate with content quality.
Podcast script generation requires LLMs to synthesize structured, context-grounded dialogue from diverse inputs, yet systematic evaluation resources for this task remain limited. To bridge this gap, we introduce PodBench, a benchmark comprising 800 samples with inputs up to 21K tokens and complex multi-speaker instructions. We propose a multifaceted evaluation framework that integrates quantitative constraints with LLM-based quality assessment. Extensive experiments reveal that while proprietary models generally excel, open-source models equipped with explicit reasoning demonstrate superior robustness in handling long contexts and multi-speaker coordination compared to standard baselines. However, our analysis uncovers a persistent divergence where high instruction following does not guarantee high content substance. PodBench offers a reproducible testbed to address these challenges in long-form, audio-centric generation.