CVAILGNov 10, 2022

Efficient Image Generation with Variadic Attention Heads

Georgia Tech
arXiv:2211.05770v324 citationsh-index: 81Has Code
Originality Incremental advance
AI Analysis

This work addresses the trade-off between computational cost and image coherence in vision transformers for image generation, offering an incremental improvement in efficiency and performance.

The paper tackles the computational inefficiency of transformers in vision models by introducing variadic attention heads that attend to multiple receptive fields, achieving a 6% improvement in FID (2.05 on FFHQ) with 28% fewer parameters and 4x throughput compared to StyleGAN-XL.

While the integration of transformers in vision models have yielded significant improvements on vision tasks they still require significant amounts of computation for both training and inference. Restricted attention mechanisms significantly reduce these computational burdens but come at the cost of losing either global or local coherence. We propose a simple, yet powerful method to reduce these trade-offs: allow the attention heads of a single transformer to attend to multiple receptive fields. We demonstrate our method utilizing Neighborhood Attention (NA) and integrate it into a StyleGAN based architecture for image generation. With this work, dubbed StyleNAT, we are able to achieve a FID of 2.05 on FFHQ, a 6% improvement over StyleGAN-XL, while utilizing 28% fewer parameters and with 4$\times$ the throughput capacity. StyleNAT achieves the Pareto Frontier on FFHQ-256 and demonstrates powerful and efficient image generation on other datasets. Our code and model checkpoints are publicly available at: https://github.com/SHI-Labs/StyleNAT

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes