SDAIMay 15

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

arXiv:2605.1583112.41 citations
Predicted impact top 43% in SD · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the challenge of autoregressive music generation by designing a tokenizer that simplifies language modeling, benefiting researchers in audio generation.

BandTok proposes a 2D Mel-spectrogram tokenizer for music generation that uses a single shared codebook for frequency-band tokens, avoiding error accumulation from residual multi-codebook quantization. It achieves strong generation results in data-limited settings.

Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and frequency-band structure during generation. Experiments show that BandTok improves over residual-codebook tokenizers and achieves strong results in a data-limited setting. The source code and generation demos for this work are publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes