TTMBA: Towards Text To Multiple Sources Binaural Audio Generation
This addresses the need for spatial audio in immersive experiences like VR, but it is incremental as it builds on existing pretrained models.
The paper tackles the problem of generating immersive binaural audio from text, which existing methods neglect, by proposing a cascaded method that segments text, generates mono audio, and renders binaural outputs with spatial control, resulting in superior audio quality and spatial accuracy.
Most existing text-to-audio (TTA) generation methods produce mono outputs, neglecting essential spatial information for immersive auditory experiences. To address this issue, we propose a cascaded method for text-to-multisource binaural audio generation (TTMBA) with both temporal and spatial control. First, a pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event. Next, a pretrained mono audio generation network creates multiple mono audios with varying durations for each event. These mono audios are transformed into binaural audios using a binaural rendering neural network based on spatial data from the LLM. Finally, the binaural audios are arranged by their start times, resulting in multisource binaural audio. Experimental results demonstrate the superiority of the proposed method in terms of both audio generation quality and spatial perceptual accuracy.