CV LG IVOct 12, 2024

Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

arXiv:2410.09347v115.814 citationsh-index: 10Has CodeICLR

Originality Incremental advance

AI Analysis

This work addresses inefficiencies in autoregressive visual generation for AI researchers and practitioners by unifying alignment and guidance methods, though it is incremental as it builds on existing techniques.

The paper tackles the problem of design inconsistencies introduced by Classifier-Free Guidance in autoregressive visual generation by proposing Condition Contrastive Alignment, which fine-tunes pretrained models to achieve guidance-free performance on par with guided methods, cutting sampling cost by half with just one epoch of fine-tuning.

Classifier-Free Guidance (CFG) is a critical technique for enhancing the sample quality of visual generative models. However, in autoregressive (AR) multi-modal generation, CFG introduces design inconsistencies between language and visual content, contradicting the design philosophy of unifying different modalities for visual AR. Motivated by language model alignment methods, we propose \textit{Condition Contrastive Alignment} (CCA) to facilitate guidance-free AR visual generation with high performance and analyze its theoretical connection with guided sampling methods. Unlike guidance methods that alter the sampling process to achieve the ideal sampling distribution, CCA directly fine-tunes pretrained models to fit the same distribution target. Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch of fine-tuning ($\sim$ 1\% of pretraining epochs) on the pretraining dataset, on par with guided sampling methods. This largely removes the need for guided sampling in AR visual generation and cuts the sampling cost by half. Moreover, by adjusting training parameters, CCA can achieve trade-offs between sample diversity and fidelity similar to CFG. This experimentally confirms the strong theoretical connection between language-targeted alignment and visual-targeted guidance methods, unifying two previously independent research fields. Code and model weights: https://github.com/thu-ml/CCA.

View on arXiv PDF Code

Similar