SDAIASJul 15, 2025

EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing

arXiv:2507.11096v1h-index: 2Has Code
Originality Incremental advance
AI Analysis

This work addresses audio editing for music generation, offering a novel approach that improves controllability and realism, though it builds on existing prompt-to-prompt and diffusion techniques.

The study tackled efficient audio editing in auto-regressive models by leveraging cross-attention control, resulting in a method that significantly outperformed a diffusion-based baseline in melody, dynamics, and tempo as shown by automatic and human evaluations.

In this study, we investigate leveraging cross-attention control for efficient audio editing within auto-regressive models. Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms. Integrating a diffusion-based strategy, influenced by Auffusion, we extend the model's functionality to support refinement edits, establishing a baseline for prompt-guided audio editing. Additionally, we introduce an alternative approach by incorporating MUSICGEN, a pre-trained frozen auto-regressive model, and propose three editing mechanisms, based on Replacement, Reweighting, and Refinement of the attention scores. We employ commonly-used music-specific evaluation metrics and a human study, to gauge time-varying controllability, adherence to global text cues, and overall audio realism. The automatic and human evaluations indicate that the proposed combination of prompt-to-prompt guidance with autoregressive generation models significantly outperforms the diffusion-based baseline in terms of melody, dynamics, and tempo of the generated audio. Our code is available at https://github.com/billsioros/EditGen

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes