CVMar 17, 2025

Don't Judge Before You CLIP: A Unified Approach for Perceptual Tasks

arXiv:2503.13260v13 citationsh-index: 5Trans. Mach. Learn. Res.
Originality Incremental advance
AI Analysis

This work addresses the challenge of building robust models for subjective visual tasks like emotion analysis and quality assessment, which is important for applications in human-computer interaction and content evaluation, though it is incremental as it adapts an existing model (CLIP) rather than introducing a new paradigm.

The paper tackles the problem of poor generalization in visual perceptual tasks due to scarce human-annotated data by proposing a unified framework that leverages CLIP as a prior, achieving state-of-the-art results on image memorability prediction, no-reference image quality assessment, and visual emotion analysis with improved generalization across datasets.

Visual perceptual tasks aim to predict human judgment of images (e.g., emotions invoked by images, image quality assessment). Unlike objective tasks such as object/scene recognition, perceptual tasks rely on subjective human assessments, making its data-labeling difficult. The scarcity of such human-annotated data results in small datasets leading to poor generalization. Typically, specialized models were designed for each perceptual task, tailored to its unique characteristics and its own training dataset. We propose a unified architectural framework for solving multiple different perceptual tasks leveraging CLIP as a prior. Our approach is based on recent cognitive findings which indicate that CLIP correlates well with human judgment. While CLIP was explicitly trained to align images and text, it implicitly also learned human inclinations. We attribute this to the inclusion of human-written image captions in CLIP's training data, which contain not only factual image descriptions, but inevitably also human sentiments and emotions. This makes CLIP a particularly strong prior for perceptual tasks. Accordingly, we suggest that minimal adaptation of CLIP suffices for solving a variety of perceptual tasks. Our simple unified framework employs a lightweight adaptation to fine-tune CLIP to each task, without requiring any task-specific architectural changes. We evaluate our approach on three tasks: (i) Image Memorability Prediction, (ii) No-reference Image Quality Assessment, and (iii) Visual Emotion Analysis. Our model achieves state-of-the-art results on all three tasks, while demonstrating improved generalization across different datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes