ASSDMay 7

Optimal Transport Audio Distance with Learned Riemannian Ground Metrics

arXiv:2605.0555427.4h-index: 8
Predicted impact top 92% in AS · last 90 daysOriginality Incremental advance
AI Analysis

Provides a more accurate and diagnostic metric for evaluating audio generative models, addressing known limitations of FAD.

OTAD improves audio generation evaluation by replacing FAD's Gaussian coupling with Sinkhorn OT and adding a learned Riemannian ground metric, achieving higher sensitivity to artifacts (1.9-3.6×) and better correlation with human MOS, plus per-sample diagnostics (AUROC ≥0.86).

In audio generation evaluation, Fréchet Audio Distance (FAD) is a 2-Wasserstein distance with structural constraints for both primitives: the cost is a frozen embedding pullback whose invariance set hides severe artifacts, and the coupling is a Gaussian fit that dilutes rank-1 contamination relative to discrete OT. We propose Optimal Transport Audio Distance (OTAD), which corrects each primitive with one dedicated mechanism -- a residual Riemannian ground-metric adapter for the cost and entropic Sinkhorn optimal transport for the coupling. Across eight encoders under a four-axis protocol, coupling-only comparisons at $ε= 0.05$ show that Sinkhorn's rank-1 sensitivity exceeds FAD's by a factor of 1.9 to 3.6. Furthermore, OTAD achieves a higher mean Spearman correlation with audio-quality MOS (DCASE 2023 Task 7) than baseline metrics. As an intrinsic benefit of the discrete transport plan, OTAD yields per-sample diagnostics with AUROC $\ge 0.86$, a capability that scalar- or kernel-aggregated metrics structurally lack.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes