Dynamic Mixture-of-Experts for Visual Autoregressive Model
This addresses efficiency issues for users of VAR models in image generation, though it appears incremental as it builds on existing VAR and MoE methods.
The paper tackles computational redundancy in Visual Autoregressive Models (VAR) for image generation by introducing a dynamic Mixture-of-Experts router with scale-aware thresholding, achieving 20% fewer FLOPs, 11% faster inference, and matching baseline image quality.
Visual Autoregressive Models (VAR) offer efficient and high-quality image generation but suffer from computational redundancy due to repeated Transformer calls at increasing resolutions. We introduce a dynamic Mixture-of-Experts router integrated into VAR. The new architecture allows to trade compute for quality through scale-aware thresholding. This thresholding strategy balances expert selection based on token complexity and resolution, without requiring additional training. As a result, we achieve 20% fewer FLOPs, 11% faster inference and match the image quality achieved by the dense baseline.