CVMar 3

PinCLIP: Large-scale Foundational Multimodal Representation at Pinterest

arXiv:2603.03544v11 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses efficiency and alignment issues in multimodal systems for Pinterest's recommendation and retrieval, offering incremental improvements with novel architectural elements.

The paper tackles the challenge of integrating visual language models into recommendation and retrieval systems by introducing PinCLIP, a large-scale visual representation learning approach that improves retrieval and ranking at Pinterest, achieving a 20% performance boost in multi-modal retrieval tasks and significant engagement gains in online tests.

While multi-modal Visual Language Models (VLMs) have demonstrated significant success across various domains, the integration of VLMs into recommendation and retrieval systems remains a challenge, due to issues like training objective discrepancies and serving efficiency bottlenecks. This paper introduces PinCLIP, a large-scale visual representation learning approach developed to enhance retrieval and ranking models at Pinterest by leveraging VLMs to learn image-text alignment. We propose a novel hybrid Vision Transformer architecture that utilizes a VLM backbone and a hybrid fusion mechanism to capture multi-modality content representation at varying granularities. Beyond standard image-to-text alignment objectives, we introduce a neighbor alignment objective to model the cross-fusion of multi-modal representations within the Pinterest Pin-Board graph. Offline evaluations show that PinCLIP outperforms state-of-the-art baselines, such as Qwen, by 20% in multi-modal retrieval tasks. Online A/B testing demonstrates significant business impact, including substantial engagement gains across all major surfaces in Pinterest. Notably, PinCLIP significantly addresses the "cold-start" problem, enhancing fresh content distribution with a 15% Repin increase in organic content and 8.7% higher click for new Ads.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes