SDLGASOct 22, 2024

Annotation-Free MIDI-to-Audio Synthesis via Concatenative Synthesis and Generative Refinement

arXiv:2410.16785v2h-index: 5
Originality Incremental advance
AI Analysis

This addresses a bottleneck for music producers and researchers by enabling more diverse and controllable audio synthesis without costly annotations, though it builds incrementally on existing synthesis techniques.

The paper tackled the problem of MIDI-to-audio synthesis requiring MIDI annotations for training, which limits timbre and expression diversity, by proposing CoSaRef, a method that uses concatenative synthesis and generative refinement without paired datasets, and it outperformed state-of-the-art supervised methods in evaluations.

Recent MIDI-to-audio synthesis methods using deep neural networks have successfully generated high-quality, expressive instrumental tracks. However, these methods require MIDI annotations for supervised training, limiting the diversity of instrument timbres and expression styles in the output. We propose CoSaRef, a MIDI-to-audio synthesis method that does not require MIDI-audio paired datasets. CoSaRef first generates a synthetic audio track using concatenative synthesis based on MIDI input, then refines it with a diffusion-based deep generative model trained on datasets without MIDI annotations. This approach improves the diversity of timbres and expression styles. Additionally, it allows detailed control over timbres and expression through audio sample selection and extra MIDI design, similar to traditional functions in digital audio workstations. Experiments showed that CoSaRef could generate realistic tracks while preserving fine-grained timbre control via one-shot samples. Moreover, despite not being supervised on MIDI annotation, CoSaRef outperformed the state-of-the-art timbre-controllable method based on MIDI supervision in both objective and subjective evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes