CVJun 25, 2021

PVT v2: Improved Baselines with Pyramid Vision Transformer

arXiv:2106.13797v72427 citationsHas Code
Originality Incremental advance
AI Analysis

This work provides improved baselines for vision Transformer research, addressing efficiency and performance bottlenecks in computer vision tasks, though it is incremental over PVT v1.

The paper tackles the computational complexity of the Pyramid Vision Transformer (PVT v1) by introducing three modifications—linear complexity attention, overlapping patch embedding, and convolutional feed-forward network—resulting in PVT v2, which reduces complexity to linear and achieves significant improvements on classification, detection, and segmentation tasks, with performance comparable to or better than Swin Transformer.

Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.

Code Implementations18 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes