CVNov 26, 2024

LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization

Rui Xie, Tianchen Zhao, Zhihang Yuan, Rui Wan, Wenxi Gao, Zhenhua Zhu, Xuefei Ning, Yu Wang

Tsinghua

arXiv:2411.17178v116.815 citationsh-index: 26Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of deploying VAR models on resource-constrained devices, representing an incremental improvement through compression techniques.

The paper tackled the high computational resource requirements of Visual Autoregressive (VAR) models for image generation by proposing efficient attention and quantization methods, achieving an 85.2% reduction in attention computation, 50% reduction in memory, and 1.5x latency reduction with less than 0.056 FID increase.

Visual Autoregressive (VAR) has emerged as a promising approach in image generation, offering competitive potential and performance comparable to diffusion-based models. However, current AR-based visual generation models require substantial computational resources, limiting their applicability on resource-constrained devices. To address this issue, we conducted analysis and identified significant redundancy in three dimensions of the VAR model: (1) the attention map, (2) the attention outputs when using classifier free guidance, and (3) the data precision. Correspondingly, we proposed efficient attention mechanism and low-bit quantization method to enhance the efficiency of VAR models while maintaining performance. With negligible performance lost (less than 0.056 FID increase), we could achieve 85.2% reduction in attention computation, 50% reduction in overall memory and 1.5x latency reduction. To ensure deployment feasibility, we developed efficient training-free compression techniques and analyze the deployment feasibility and efficiency gain of each technique.

View on arXiv PDF Code

Similar