CLAIIROct 22, 2024

Captions Speak Louder than Images: Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data

arXiv:2410.17337v25 citationsh-index: 8IJCNLP-AACL
Originality Incremental advance
AI Analysis

This addresses challenges in e-commerce applications by providing a new dataset and framework, but it is incremental as it builds on existing multimodal foundation models.

The paper tackles the problem of limited multimodal benchmark datasets and integration methods for e-commerce by introducing MMECInstruct, a large-scale multimodal instruction dataset, and CASLIE, a framework for fine-tuning models, resulting in models that substantially outperform advanced baselines in in-domain evaluations and show strong out-of-domain generalizability.

Leveraging multimodal data to drive breakthroughs in e-commerce applications through Multimodal Foundation Models (MFMs) is gaining increasing attention from the research community. However, there are significant challenges that hinder the optimal use of multimodal e-commerce data by foundation models: (1) the scarcity of large-scale, high-quality multimodal benchmark datasets; and (2) the lack of effective multimodal information integration methods. To address these challenges, in this paper, we introduce MMECInstruct, the first-ever, large-scale, and high-quality multimodal instruction dataset for e-commerce. We also develop CASLIE, a simple, lightweight, yet effective framework for integrating multimodal information for e-commerce. Leveraging MMECInstruct, we fine-tune a series of e-commerce MFMs within CASLIE, denoted as CASLIE models. Our comprehensive evaluation demonstrates that CASLIE models substantially outperform 5 categories of advanced baseline models in the in-domain evaluation. Moreover, CASLIE models show strong generalizability to out-of-domain settings. MMECInstruct and CASLIE models are publicly accessible through https://ninglab.github.io/CASLIE/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes