CVSep 22, 2025

Revisiting Vision Language Foundations for No-Reference Image Quality Assessment

arXiv:2509.17374v1h-index: 2
Originality Incremental advance
AI Analysis

This work addresses the problem of image quality assessment for computer vision applications, providing incremental improvements through systematic evaluation and a novel activation mechanism.

The paper systematically evaluated six pretrained vision backbones for no-reference image quality assessment, finding that SigLIP2 performs well and activation functions significantly impact generalization, with sigmoid outperforming ReLU/GELU; they introduced a learnable activation selection mechanism that achieved new state-of-the-art SRCC scores on CLIVE, KADID10K, and AGIQA3K benchmarks.

Large-scale vision language pre-training has recently shown promise for no-reference image-quality assessment (NR-IQA), yet the relative merits of modern Vision Transformer foundations remain poorly understood. In this work, we present the first systematic evaluation of six prominent pretrained backbones, CLIP, SigLIP2, DINOv2, DINOv3, Perception, and ResNet, for the task of No-Reference Image Quality Assessment (NR-IQA), each finetuned using an identical lightweight MLP head. Our study uncovers two previously overlooked factors: (1) SigLIP2 consistently achieves strong performance; and (2) the choice of activation function plays a surprisingly crucial role, particularly for enhancing the generalization ability of image quality assessment models. Notably, we find that simple sigmoid activations outperform commonly used ReLU and GELU on several benchmarks. Motivated by this finding, we introduce a learnable activation selection mechanism that adaptively determines the nonlinearity for each channel, eliminating the need for manual activation design, and achieving new state-of-the-art SRCC on CLIVE, KADID10K, and AGIQA3K. Extensive ablations confirm the benefits across architectures and regimes, establishing strong, resource-efficient NR-IQA baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes