LGMay 12

Efficient Adjoint Matching for Fine-tuning Diffusion Models

Jeongwoo Shin, Dongsoo Shin, Joonseok Lee, Jaewoong Choi, Jaemoo Choi

arXiv:2605.1148080.2

Predicted impact top 15% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners fine-tuning text-to-image diffusion models, EAM provides a computationally cheaper method to align with human preferences without sacrificing performance.

Efficient Adjoint Matching (EAM) reformulates reward fine-tuning of diffusion models as a stochastic optimal control problem with linear base drift, eliminating the need for full trajectory simulation and backward adjoint ODE. EAM converges up to 4x faster than Adjoint Matching (AM) while matching or surpassing it on metrics like PickScore, ImageReward, HPSv2.1, CLIPScore, and Aesthetics.

Reward fine-tuning has become a common approach for aligning pretrained diffusion and flow models with human preferences in text-to-image generation. Among reward-gradient-based methods, Adjoint Matching (AM) provides a principled formulation by casting reward fine-tuning as a stochastic optimal control (SOC) problem. However, AM inevitably requires a substantial computational cost: it requires (i) stochastic simulation of full generative trajectories under memoryless dynamics, resulting in a large number of function evaluations, and (ii) backward ODE simulation of the adjoint state along each sampled trajectory. In this work, we observe that both bottlenecks are closely tied to the \textit{non-trivial base drift} inherited from the pretrained model. Motivated by this observation, we propose \textbf{Efficient Adjoint Matching (EAM)}, which substantially improves training efficiency by reformulating the SOC problem with a \textit{linear base drift} and a correspondingly modified \textit{terminal cost}. This reformulation removes both sources of inefficiency; it enables training-time sampling with a few-step deterministic ODE solver and yields a closed-form adjoint solution that eliminates backward adjoint simulation. On standard text-to-image reward fine-tuning benchmarks, EAM converges up to 4x faster than AM and matches or surpasses it across various metrics including PickScore, ImageReward, HPSv2.1, CLIPScore and Aesthetics.

View on arXiv PDF

Similar