Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models
This addresses efficiency issues in diffusion models for image generation, offering a scalable alternative to Transformers, though it is incremental as it adapts an existing NLP architecture.
The paper tackled the computational complexity of Transformers in high-resolution image generation by adapting RWKV architectures for diffusion models, achieving performance comparable to or better than existing models in FID and IS metrics while reducing FLOP usage.
Transformers have catalyzed advancements in computer vision and natural language processing (NLP) fields. However, substantial computational complexity poses limitations for their application in long-context tasks, such as high-resolution image generation. This paper introduces a series of architectures adapted from the RWKV model used in the NLP, with requisite modifications tailored for diffusion model applied to image generation tasks, referred to as Diffusion-RWKV. Similar to the diffusion with Transformers, our model is designed to efficiently handle patchnified inputs in a sequence with extra conditions, while also scaling up effectively, accommodating both large-scale parameters and extensive datasets. Its distinctive advantage manifests in its reduced spatial aggregation complexity, rendering it exceptionally adept at processing high-resolution images, thereby eliminating the necessity for windowing or group cached operations. Experimental results on both condition and unconditional image generation tasks demonstrate that Diffison-RWKV achieves performance on par with or surpasses existing CNN or Transformer-based diffusion models in FID and IS metrics while significantly reducing total computation FLOP usage.