CV AIJan 18, 2024

CLIP Model for Images to Textual Prompts Based on Top-k Neighbors

Xin Zhang, Xin Zhang, YeMing Cai, Tianzhi Jia

arXiv:2401.09763v15.22 citations2023 3rd International Conference on Electronic Information Engineering and Computer Science (EIECS)

Originality Synthesis-oriented

AI Analysis

This addresses a cost-effective solution for image-to-prompt generation in multimodal AI, but it appears incremental as it builds on existing CLIP and KNN techniques.

The paper tackles the problem of generating textual prompts from images without requiring large annotated datasets by proposing a two-stage method using CLIP and KNN, achieving a highest metric of 0.612, which is 0.013 to 0.055 higher than baseline models.

Text-to-image synthesis, a subfield of multimodal generation, has gained significant attention in recent years. We propose a cost-effective approach for image-to-prompt generation that leverages generative models to generate textual prompts without the need for large amounts of annotated data. We divide our method into two stages: online stage and offline stage. We use a combination of the CLIP model and K-nearest neighbors (KNN) algorithm. The proposed system consists of two main parts: an offline task and an online task. Our method owns the highest metric 0.612 among these models, which is 0.013, 0.055, 0.011 higher than Clip, Clip + KNN(top 10) respectively.

View on arXiv PDF

Similar