CVMay 21, 2024

Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction

arXiv:2405.13218v216 citationsh-index: 33
Originality Incremental advance
AI Analysis

This provides a direct compute-controlled comparison for practitioners choosing between popular image synthesis methods, though it is incremental as it analyzes existing paradigms.

The paper tackled the problem of comparing image synthesis approaches (diffusion, masked-token, and next-token prediction) under controlled compute budgets, finding that next-token prediction significantly outperforms diffusion in prompt following and is more efficient, while diffusion may match it in image quality with scaling.

Nearly every recent image synthesis approach, including diffusion, masked-token prediction, and next-token prediction, uses a Transformer network architecture. Despite this common backbone, there has been no direct, compute controlled comparison of how these approaches affect performance and efficiency. We analyze the scalability of each approach through the lens of compute budget measured in FLOPs. We find that token prediction methods, led by next-token prediction, significantly outperform diffusion on prompt following. On image quality, while next-token prediction initially performs better, scaling trends suggest it is eventually matched by diffusion. We compare the inference compute efficiency of each approach and find that next token prediction is by far the most efficient. Based on our findings we recommend diffusion for applications targeting image quality and low latency; and next-token prediction when prompt following or throughput is more important.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes