Impression-CLIP: Contrastive Shape-Impression Embedding for Fonts
This addresses the challenge of cross-modal retrieval between fonts and impressions for designers and researchers, though it is incremental as it adapts an existing CLIP framework to a specific domain.
The paper tackled the problem of capturing weak and unstable correlations between font shapes and subjective impressions by proposing Impression-CLIP, a CLIP-based model for co-embedding fonts and impressions, which achieved better retrieval accuracy than the state-of-the-art method in experiments.
Fonts convey different impressions to readers. These impressions often come from the font shapes. However, the correlation between fonts and their impression is weak and unstable because impressions are subjective. To capture such weak and unstable cross-modal correlation between font shapes and their impressions, we propose Impression-CLIP, which is a novel machine-learning model based on CLIP (Contrastive Language-Image Pre-training). By using the CLIP-based model, font image features and their impression features are pulled closer, and font image features and unrelated impression features are pushed apart. This procedure realizes co-embedding between font image and their impressions. In our experiment, we perform cross-modal retrieval between fonts and impressions through co-embedding. The results indicate that Impression-CLIP achieves better retrieval accuracy than the state-of-the-art method. Additionally, our model shows the robustness to noise and missing tags.