CV AI LGNov 10, 2022

Efficient Image Generation with Variadic Attention Heads

Steven Walton, Ali Hassani, Xingqian Xu, Zhangyang Wang, Humphrey Shi

Georgia Tech

arXiv:2211.05770v314.924 citationsh-index: 62Has Code

Originality Incremental advance

AI Analysis

This work addresses the trade-off between computational cost and image coherence in vision transformers for image generation, offering an incremental improvement in efficiency and performance.

The paper tackles the computational inefficiency of transformers in vision models by introducing variadic attention heads that attend to multiple receptive fields, achieving a 6% improvement in FID (2.05 on FFHQ) with 28% fewer parameters and 4x throughput compared to StyleGAN-XL.

While the integration of transformers in vision models have yielded significant improvements on vision tasks they still require significant amounts of computation for both training and inference. Restricted attention mechanisms significantly reduce these computational burdens but come at the cost of losing either global or local coherence. We propose a simple, yet powerful method to reduce these trade-offs: allow the attention heads of a single transformer to attend to multiple receptive fields. We demonstrate our method utilizing Neighborhood Attention (NA) and integrate it into a StyleGAN based architecture for image generation. With this work, dubbed StyleNAT, we are able to achieve a FID of 2.05 on FFHQ, a 6% improvement over StyleGAN-XL, while utilizing 28% fewer parameters and with 4$\times$ the throughput capacity. StyleNAT achieves the Pareto Frontier on FFHQ-256 and demonstrates powerful and efficient image generation on other datasets. Our code and model checkpoints are publicly available at: https://github.com/SHI-Labs/StyleNAT

View on arXiv PDF Code

Similar