UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment
This work addresses the problem of fragmented assessment models for image quality and aesthetics, offering a unified approach that could benefit researchers and practitioners in computer vision, though it appears incremental by building on existing multimodal pre-training methods.
The paper tackles the challenge of unifying Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) by proposing UniQA, a vision-language pre-training model that uses generated text descriptions and a lightweight adapter to achieve competitive performance across various tasks, including classical IQA and IAA, with codes made available.
Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) aim to simulate human subjective perception of image visual quality and aesthetic appeal. Despite distinct learning objectives, they have underlying interconnectedness due to consistent human assessment perception. In this paper, we propose Unified vision-language pre-training of Quality and Aesthetics (UniQA}), to extract useful and common representations from two tasks, thereby benefiting them simultaneously. However, the lack of text in the IQA datasets and the textual noise in the IAA datasets pose severe challenges for multimodal pre-training. To address this, we (1) utilize multimodal large language models (MLLMs) to generate high-quality text descriptions; (2) use the generated text for IAA as metadata to purify noisy IAA data. To effectively adapt the pre-trained UniQA to downstream tasks, we further propose a lightweight adapter that utilizes versatile cues to fully exploit the extensive knowledge of the pre-trained model. UniQA demonstrates high competitiveness in various image assessment tasks, including classical IQA and IAA tasks, few-label IQA, and other downstream tasks, showing promise as a foundational assessment model. Codes are available at https://github.com/zht8506/UniQA.