Low-Resource Guidance for Controllable Latent Audio Diffusion
This work provides a computationally efficient method for fine-grained control of latent audio diffusion models, which is beneficial for researchers and practitioners working on generative audio applications.
This paper addresses the high computational cost of guidance-based control in generative audio models by introducing Latent-Control Heads (LatCHs). LatCHs operate in latent space, avoiding expensive decoder backpropagation, and achieve effective control over audio intensity, pitch, and beats with minimal training resources (7M parameters, ~4 hours of training) while maintaining generation quality.
Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textit{e.g.}, guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-based controls, in particular their high cost-per-step due to decoder backpropagation, we introduce a guidance-based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and $\approx$ 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance. Demo examples can be found at https://zacharynovack.github.io/latch/latch.html.