Flow-HOA: Generative Joint Optimization for Ambisonics Encoding via Flow Matching
This work addresses the challenge of high-quality spatial audio capture for consumer immersive communication and XR, offering a practical solution that generalizes from synthetic to real-world data.
Flow-HOA introduces a generative framework using conditional flow matching to jointly optimize time-domain, spectral, and spatial fidelity for Ambisonics encoding from sparse microphone arrays, achieving improved signal fidelity and spatial accuracy over model-based baselines, with subjective tests confirming higher sound quality on real recordings.
Higher-Order Ambisonics (HOA) encoding from sparse, irregular microphone arrays remains a critical challenge for consumer spatial audio capture in immersive communication and XR. We propose Flow-HOA, a generative framework that jointly optimizes a multi-dimensional objective encompassing time-domain, spectral, and spatial fidelity while producing a deployable, time-invariant bank of Finite Impulse Response (FIR) encoding filters. Using conditional flow matching, the model learns to map a simple prior distribution to the target distribution of FIR filter coefficients. Training is guided by a composite loss that balances time-domain waveform fidelity, multi-resolution spectral consistency, sub-band energy preservation, and spatial directivity constraints. Objective evaluations on synthetically simulated data demonstrate improved performance over strong model-based baselines in both signal fidelity and spatial accuracy metrics. Subjective listening tests on real microphone array recordings further confirm that Flow-HOA yields higher overall sound quality with reduced artifacts, demonstrating generalization from synthetic training data to real-world capture conditions.