CVMMJul 28, 2023

CLIP Brings Better Features to Visual Aesthetics Learners

arXiv:2307.15640v26 citationsh-index: 16
Originality Incremental advance
AI Analysis

This work addresses the problem of limited data and resources in IAA for applications like image quality evaluation, though it is incremental as it builds on existing CLIP models.

The paper tackles the challenge of Image Aesthetics Assessment (IAA) by proposing a CLIP-based semi-supervised knowledge distillation method, achieving state-of-the-art performance on multiple benchmarks.

Image Aesthetics Assessment (IAA) is a challenging task due to its subjective nature and expensive manual annotations. Recent large-scale vision-language models, such as Contrastive Language-Image Pre-training (CLIP), have shown their promising representation capability for various downstream tasks. However, the application of CLIP to resource-constrained and low-data IAA tasks remains limited. While few attempts to leverage CLIP in IAA have mainly focused on carefully designed prompts, we extend beyond this by allowing models from different domains and with different model sizes to acquire knowledge from CLIP. To achieve this, we propose a unified and flexible two-phase CLIP-based Semi-supervised Knowledge Distillation (CSKD) paradigm, aiming to learn a lightweight IAA model while leveraging CLIP's strong generalization capability. Specifically, CSKD employs a feature alignment strategy to facilitate the distillation of heterogeneous CLIP teacher and IAA student models, effectively transferring valuable features from pre-trained visual representations to two lightweight IAA models, respectively. To efficiently adapt to downstream IAA tasks in a low-data regime, the two strong visual aesthetics learners then conduct distillation with unlabeled examples for refining and transferring the task-specific knowledge collaboratively. Extensive experiments demonstrate that the proposed CSKD achieves state-of-the-art performance on multiple widely used IAA benchmarks. Furthermore, analysis of attention distance and entropy before and after feature alignment shows the effective transfer of CLIP's feature representation to IAA models, which not only provides valuable guidance for the model initialization of IAA but also enhances the aesthetic feature representation of IAA models. Code will be made publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes