CVDec 2, 2025

Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling

Aditya Chaudhary, Prachet Dev Singh, Ankit Jha

arXiv:2512.02512v23.6h-index: 9

Originality Incremental advance

AI Analysis

This work addresses image restoration for computer vision applications, presenting an incremental improvement through a novel training approach.

The paper tackles Single Image Super-Resolution (SISR) by proposing ViT-SR, a Vision Transformer with a two-stage training strategy involving self-supervised colorization pretraining and residual upsampling, achieving an SSIM of 0.712 and PSNR of 22.90 dB on the DIV2K benchmark.

In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.

View on arXiv PDF

Similar