Hantao Zhou, Longxiang Tang, Rui Yang et al.
Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) aim to simulate human subjective perception of image visual quality and aesthetic appeal. Despite distinct learning objectives, they have underlying interconnectedness due to consistent human assessment perception. In this paper, we propose Unified vision-language pre-training of Quality and Aesthetics (UniQA}), to extract useful and common representations from two tasks, thereby benefiting them simultaneously. However, the lack of text in the IQA datasets and the textual noise in the IAA datasets pose severe challenges for multimodal pre-training. To address this, we (1) utilize multimodal large language models (MLLMs) to generate high-quality text descriptions; (2) use the generated text for IAA as metadata to purify noisy IAA data. To effectively adapt the pre-trained UniQA to downstream tasks, we further propose a lightweight adapter that utilizes versatile cues to fully exploit the extensive knowledge of the pre-trained model. UniQA demonstrates high competitiveness in various image assessment tasks, including classical IQA and IAA tasks, few-label IQA, and other downstream tasks, showing promise as a foundational assessment model. Codes are available at https://github.com/zht8506/UniQA.