CVAIApr 24, 2025

FashionM3: Multimodal, Multitask, and Multiround Fashion Assistant based on Unified Vision-Language Model

arXiv:2504.17826v14 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

This work addresses fashion styling and personalized recommendations for users in retail, but it is incremental as it builds on existing vision-language models with fine-tuning on a new dataset.

The paper tackles the problem of fashion styling and personalized recommendations by proposing FashionM3, a multimodal, multitask, and multiround fashion assistant based on a fine-tuned vision-language model, which delivers contextually personalized suggestions with iterative refinement and shows superior performance in recommendation effectiveness and practical value.

Fashion styling and personalized recommendations are pivotal in modern retail, contributing substantial economic value in the fashion industry. With the advent of vision-language models (VLM), new opportunities have emerged to enhance retailing through natural language and visual interactions. This work proposes FashionM3, a multimodal, multitask, and multiround fashion assistant, built upon a VLM fine-tuned for fashion-specific tasks. It helps users discover satisfying outfits by offering multiple capabilities including personalized recommendation, alternative suggestion, product image generation, and virtual try-on simulation. Fine-tuned on the novel FashionRec dataset, comprising 331,124 multimodal dialogue samples across basic, personalized, and alternative recommendation tasks, FashionM3 delivers contextually personalized suggestions with iterative refinement through multiround interactions. Quantitative and qualitative evaluations, alongside user studies, demonstrate FashionM3's superior performance in recommendation effectiveness and practical value as a fashion assistant.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes