Making the Most of Limited Data: Score-Aware Training for Text-to-Music Generation
This work addresses the data scarcity problem in text-to-music generation for researchers with limited resources, offering a method to improve performance without large proprietary datasets.
The authors propose score-aware training for text-to-music generation, using audio-caption alignment scores as supervision to repurpose low-quality data via a CLAP-conditioned noise schedule and filtering, achieving 2nd place in objective evaluation and 3rd in MOS in the ICME 2026 ATTM Grand Challenge.
State-of-the-art text-to-music generation systems rely on massive proprietary datasets and industrial-scale compute, making it impossible to disentangle architectural contributions from resource advantages. We propose \textit{score-aware training}, which treats audio-caption alignment score as a direct supervision signal throughout the pipeline. Rather than discarding low-scoring segments, we repurpose them via a CLAP-conditioned Beta noise timestep schedule that routes them to high-noise training regimes, acting as an effective implicit regularizer. Complementarily, segment-level filtering removes the most misaligned examples, and a two-stage caption procedure bridges the distribution gap between verbose training captions and concise inference prompts. A REPA auxiliary loss further transfers structured semantic knowledge from pretrained CLAP and MuQ encoders without additional data. Our 450M-parameter FluxAudio-based system, submitted to the ICME 2026 ATTM Grand Challenge Efficiency Track, ranked 2nd across both tracks in the objective evaluation and 3rd in the Efficiency Track in the final MOS evaluation.