CVFeb 19, 2024

FiT: Flexible Vision Transformer for Diffusion Model

arXiv:2402.12376v489 citationsh-index: 17Has CodeICML
Originality Incremental advance
AI Analysis

This addresses a limitation in diffusion models for image generation, allowing more flexible and unbiased outputs, though it is incremental as it builds on existing transformer architectures.

The paper tackles the problem of diffusion models struggling with image resolutions outside their trained domain by introducing the Flexible Vision Transformer (FiT), which enables generating images with unrestricted resolutions and aspect ratios, showing exceptional performance across a broad range of resolutions.

Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this limitation, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens. This perspective enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping. Enhanced by a meticulously adjusted network structure and the integration of training-free extrapolation techniques, FiT exhibits remarkable flexibility in resolution extrapolation generation. Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions, showcasing its effectiveness both within and beyond its training resolution distribution. Repository available at https://github.com/whlzy/FiT.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes