End-to-end multi-modal product matching in fashion e-commerce
This addresses product matching for online marketplace and e-commerce companies, offering a robust solution with near-perfect precision in production, though it is incremental as it builds on existing pretrained encoders and contrastive learning methods.
The paper tackled product matching in fashion e-commerce by developing a multi-modal system using pretrained image and text encoders with contrastive learning, achieving state-of-the-art results and outperforming single-modality systems and models like CLIP while balancing cost and performance.
Product matching, the task of identifying different representations of the same product for better discoverability, curation, and pricing, is a key capability for online marketplace and e-commerce companies. We present a robust multi-modal product matching system in an industry setting, where large datasets, data distribution shifts and unseen domains pose challenges. We compare different approaches and conclude that a relatively straightforward projection of pretrained image and text encoders, trained through contrastive learning, yields state-of-the-art results, while balancing cost and performance. Our solution outperforms single modality matching systems and large pretrained models, such as CLIP. Furthermore we show how a human-in-the-loop process can be combined with model-based predictions to achieve near perfect precision in a production system.