Weakly Supervised Annotations for Multi-modal Greeting Cards Dataset
This work addresses the challenge of handling abstract concepts in multimodal datasets for researchers in computer vision and AI, but it is incremental as it builds on existing pretrained models without major methodological breakthroughs.
The authors tackled the problem of learning abstract visual concepts from scarce multimodal data by introducing the Greeting Cards Dataset (GCD) and aggregating features from pretrained image and text embeddings. They demonstrated the dataset's utility for generating greeting card images using a pretrained text-to-image model.
In recent years, there is a growing number of pre-trained models trained on a large corpus of data and yielding good performance on various tasks such as classifying multimodal datasets. These models have shown good performance on natural images but are not fully explored for scarce abstract concepts in images. In this work, we introduce an image/text-based dataset called Greeting Cards. Dataset (GCD) that has abstract visual concepts. In our work, we propose to aggregate features from pretrained images and text embeddings to learn abstract visual concepts from GCD. This allows us to learn the text-modified image features, which combine complementary and redundant information from the multi-modal data streams into a single, meaningful feature. Secondly, the captions for the GCD dataset are computed with the pretrained CLIP-based image captioning model. Finally, we also demonstrate that the proposed the dataset is also useful for generating greeting card images using pre-trained text-to-image generation model.