CVMar 17, 2024

Training A Small Emotional Vision Language Model for Visual Art Comprehension

Jing Zhang, Liang Zheng, Meng Wang, Dan Guo

arXiv:2403.11150v26.511 citationsh-index: 23Has CodeECCV

Originality Incremental advance

AI Analysis

This addresses the computational efficiency vs. capacity trade-off for visual art comprehension, though it is incremental as it builds on existing small vision language models.

The paper tackles the problem of enabling small vision language models to understand visual art by identifying emotion categories and generating explanations, achieving competitive performance with large models like LLaVA 7B and GPT4(V) while being trainable on a single RTX 2080 Ti.

This paper develops small vision language models to understand visual art, which, given an art work, aims to identify its emotion category and explain this prediction with natural language. While small models are computationally efficient, their capacity is much limited compared with large models. To break this trade-off, this paper builds a small emotional vision language model (SEVLM) by emotion modeling and input-output feature alignment. On the one hand, based on valence-arousal-dominance (VAD) knowledge annotated by psychology experts, we introduce and fuse emotional features derived through VAD dictionary and a VAD head to align VAD vectors of predicted emotion explanation and the ground truth. This allows the vision language model to better understand and generate emotional texts, compared with using traditional text embeddings alone. On the other hand, we design a contrastive head to pull close embeddings of the image, its emotion class, and explanation, which aligns model outputs and inputs. On two public affective explanation datasets, we show that the proposed techniques consistently improve the visual art understanding performance of baseline SEVLMs. Importantly, the proposed model can be trained and evaluated on a single RTX 2080 Ti while exhibiting very strong performance: it not only outperforms the state-of-the-art small models but is also competitive compared with LLaVA 7B after fine-tuning and GPT4(V). The code is available at https://github.com/BetterZH/SEVLM-code.

View on arXiv PDF Code

Similar