CVAug 18, 2025

WP-CLIP: Leveraging CLIP to Predict Wölfflin's Principles in Visual Art

arXiv:2508.12668v1h-index: 62025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Originality Synthesis-oriented
AI Analysis

This work addresses the need for automated formal analysis in art history and computational aesthetics, though it is incremental as it adapts an existing vision-language model to a specific domain.

The paper tackled the problem of predicting Wölfflin's five principles for stylistic analysis in visual art, which lacked effective metrics, by fine-tuning CLIP on annotated art datasets; the resulting WP-CLIP model generalized across diverse artistic styles, including GAN-generated paintings and the Pandora-18K dataset.

Wölfflin's five principles offer a structured approach to analyzing stylistic variations for formal analysis. However, no existing metric effectively predicts all five principles in visual art. Computationally evaluating the visual aspects of a painting requires a metric that can interpret key elements such as color, composition, and thematic choices. Recent advancements in vision-language models (VLMs) have demonstrated their ability to evaluate abstract image attributes, making them promising candidates for this task. In this work, we investigate whether CLIP, pre-trained on large-scale data, can understand and predict Wölfflin's principles. Our findings indicate that it does not inherently capture such nuanced stylistic elements. To address this, we fine-tune CLIP on annotated datasets of real art images to predict a score for each principle. We evaluate our model, WP-CLIP, on GAN-generated paintings and the Pandora-18K art dataset, demonstrating its ability to generalize across diverse artistic styles. Our results highlight the potential of VLMs for automated art analysis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes