Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs
This work addresses the problem of realistic drum audio generation from symbolic representations for music production and research, but the results are incremental as it applies existing codec-token prediction to a new domain.
The authors propose a Transformer-based system that converts expressive drum grids (MIDI with microtiming and velocity) into drum audio by predicting discrete codes from neural audio codecs (EnCodec, DAC, X-Codec). Evaluated on the E-GMD dataset, the method achieves effective drum synthesis, establishing codec-token prediction as a viable approach for percussive audio generation.
Generating realistic drum audio directly from symbolic representations is a challenging task at the intersection of music perception and machine learning. We propose a system that transforms an expressive drum grid, a time-aligned MIDI representation with microtiming and velocity information, into drum audio by predicting discrete codes of a neural audio codec. Our approach uses a Transformer-based model to map the drum grid input to a sequence of codec tokens, which are then converted to waveform audio via a pre-trained codec decoder. We experiment with multiple state-of-the-art neural codecs, namely EnCodec, DAC, and X-Codec, to assess how the choice of audio representation impacts the quality of the generated drums. The system is trained and evaluated on the Expanded Groove MIDI Dataset, E-GMD, a large collection of human drum performances with paired MIDI and audio. We evaluate the fidelity and musical alignment of the generated audio using objective metrics. Overall, our results establish codec-token prediction as an effective route for drum grid-to-audio generation and provide practical insights into selecting audio tokenizers for percussive synthesis.