CVAISep 5, 2024

LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution

arXiv:2409.03516v17 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This addresses efficiency and accuracy problems for image super-resolution applications, representing an incremental improvement over existing ViT methods.

The paper tackles the high complexity and window boundary issues in Vision Transformer-based image super-resolution by proposing LMLT, which uses attention with varying feature sizes across heads, reducing inference time and GPU memory usage while matching or exceeding state-of-the-art performance.

Recent Vision Transformer (ViT)-based methods for Image Super-Resolution have demonstrated impressive performance. However, they suffer from significant complexity, resulting in high inference times and memory usage. Additionally, ViT models using Window Self-Attention (WSA) face challenges in processing regions outside their windows. To address these issues, we propose the Low-to-high Multi-Level Transformer (LMLT), which employs attention with varying feature sizes for each head. LMLT divides image features along the channel dimension, gradually reduces spatial size for lower heads, and applies self-attention to each head. This approach effectively captures both local and global information. By integrating the results from lower heads into higher heads, LMLT overcomes the window boundary issues in self-attention. Extensive experiments show that our model significantly reduces inference time and GPU memory usage while maintaining or even surpassing the performance of state-of-the-art ViT-based Image Super-Resolution methods. Our codes are availiable at https://github.com/jwgdmkj/LMLT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes