CV MMJul 28, 2023

CLIP Brings Better Features to Visual Aesthetics Learners

Liwu Xu, Jinjin Xu, Yuzhe Yang, Xilu Wang, Yijie Huang, Yaqian Li

arXiv:2307.15640v25.06 citationsh-index: 16

Originality Incremental advance

AI Analysis

This work addresses the problem of limited data and resources in IAA for applications like image quality evaluation, though it is incremental as it builds on existing CLIP models.

The paper tackles the challenge of Image Aesthetics Assessment (IAA) by proposing a CLIP-based semi-supervised knowledge distillation method, achieving state-of-the-art performance on multiple benchmarks.

Image Aesthetics Assessment (IAA) is a challenging task due to its subjective nature and expensive manual annotations. Recent large-scale vision-language models, such as Contrastive Language-Image Pre-training (CLIP), have shown their promising representation capability for various downstream tasks. However, the application of CLIP to resource-constrained and low-data IAA tasks remains limited. While few attempts to leverage CLIP in IAA have mainly focused on carefully designed prompts, we extend beyond this by allowing models from different domains and with different model sizes to acquire knowledge from CLIP. To achieve this, we propose a unified and flexible two-phase CLIP-based Semi-supervised Knowledge Distillation (CSKD) paradigm, aiming to learn a lightweight IAA model while leveraging CLIP's strong generalization capability. Specifically, CSKD employs a feature alignment strategy to facilitate the distillation of heterogeneous CLIP teacher and IAA student models, effectively transferring valuable features from pre-trained visual representations to two lightweight IAA models, respectively. To efficiently adapt to downstream IAA tasks in a low-data regime, the two strong visual aesthetics learners then conduct distillation with unlabeled examples for refining and transferring the task-specific knowledge collaboratively. Extensive experiments demonstrate that the proposed CSKD achieves state-of-the-art performance on multiple widely used IAA benchmarks. Furthermore, analysis of attention distance and entropy before and after feature alignment shows the effective transfer of CLIP's feature representation to IAA models, which not only provides valuable guidance for the model initialization of IAA but also enhances the aesthetic feature representation of IAA models. Code will be made publicly available.

View on arXiv PDF

Similar