CVAICLNov 24, 2025

From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation

arXiv:2511.19149v1
Originality Incremental advance
AI Analysis

This work addresses the need for automated, visually grounded content generation in the fashion domain, offering a scalable solution for applications like e-commerce and social media, though it is incremental as it builds on existing retrieval-augmented and detection methods.

This paper tackles the problem of generating accurate and stylistically interesting captions and hashtags for fashion images by introducing a retrieval-augmented framework that combines multi-garment detection, attribute reasoning, and LLM prompting. The result is a system that achieves a mean attribute coverage of 0.80 in hashtag generation and reduces hallucination compared to a baseline BLIP model.

This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes