CVDec 2, 2025

Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling

arXiv:2512.02512v2h-index: 9
Originality Incremental advance
AI Analysis

This work addresses image restoration for computer vision applications, presenting an incremental improvement through a novel training approach.

The paper tackles Single Image Super-Resolution (SISR) by proposing ViT-SR, a Vision Transformer with a two-stage training strategy involving self-supervised colorization pretraining and residual upsampling, achieving an SSIM of 0.712 and PSNR of 22.90 dB on the DIV2K benchmark.

In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes