Zihao Chen, Zhentao Lin, Bi Zeng et al.
To ensure the reliable operation of speech systems across diverse environments, noise addition methods have emerged as the standard solution.However, existing methods offer limited coverage of real-world scenes and depend on pre-existing noise libraries and scene metadata.This paper presents prompt-based Dynamic Generative Scene-based Noise Addition (DGSNA), a novel approach driven by generative language models that integrates Dynamic Generation of Scene-based Information (DGSI) with Scene-based Noise Addition for Speech (SNAS).The DGSI module, with a BET (Background, Examples, Task) prompt framework, dynamically generates logic-compliant scene-based information, including scene dimensions, sound sources, and microphone positions, thereby addressing the challenges of scene enumeration and detailed description.Complementing this, the SNAS module employs a Time-Frequency Diffusion-based (TFD) Text-to-Audio model to synthesize scene-specific noise. By integrating this noise with clean speech via Room Impulse Response (RIR) filters, the module streamlines the traditionally labor-intensive process of replicating diverse acoustic environments.Experimental results show that DGSNA significantly enhances the robustness of speech recognition and keyword spotting models, achieving relative improvements of up to 11.32\%. Furthermore, DGSNA is highly compatible with existing noise addition techniques. Our implementation and demonstrations are available at https://dgsna.github.io.