A Metric for MLLM Alignment in Large-scale Recommendation
This addresses the problem of costly and inaccurate MLLM alignment evaluation for large-scale recommender systems, though it is incremental as it builds on existing multimodal recommendation techniques.
The paper tackles the challenge of evaluating multimodal large language model (MLLM) alignment in recommendation systems by proposing the Leakage Impact Score (LIS) metric, which measures the upper bound of preference data efficiently, and online A/B tests on Xiaohongshu's production systems show significant improvements in user spent time and advertiser value.
Multimodal recommendation has emerged as a critical technique in modern recommender systems, leveraging content representations from advanced multimodal large language models (MLLMs). To ensure these representations are well-adapted, alignment with the recommender system is essential. However, evaluating the alignment of MLLMs for recommendation presents significant challenges due to three key issues: (1) static benchmarks are inaccurate because of the dynamism in real-world applications, (2) evaluations with online system, while accurate, are prohibitively expensive at scale, and (3) conventional metrics fail to provide actionable insights when learned representations underperform. To address these challenges, we propose the Leakage Impact Score (LIS), a novel metric for multimodal recommendation. Rather than directly assessing MLLMs, LIS efficiently measures the upper bound of preference data. We also share practical insights on deploying MLLMs with LIS in real-world scenarios. Online A/B tests on both Content Feed and Display Ads of Xiaohongshu's Explore Feed production demonstrate the effectiveness of our proposed method, showing significant improvements in user spent time and advertiser value.