IRLGSep 5, 2025

Multimodal Foundation Model-Driven User Interest Modeling and Behavior Analysis on Short Video Platforms

arXiv:2509.04751v12 citationsh-index: 52025 7th International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI)
Originality Incremental advance
AI Analysis

It addresses the problem of limited user interest modeling in personalized recommendation systems for short video platforms, representing an incremental advancement by combining multimodal data with behavior-driven features.

This paper tackles the challenge of capturing user preferences in short video platforms by proposing a multimodal foundation model-based framework that integrates video, text, and music into a unified semantic space, resulting in significant improvements in behavior prediction accuracy, cold-start user interest modeling, and recommendation click-through rates.

With the rapid expansion of user bases on short video platforms, personalized recommendation systems are playing an increasingly critical role in enhancing user experience and optimizing content distribution. Traditional interest modeling methods often rely on unimodal data, such as click logs or text labels, which limits their ability to fully capture user preferences in a complex multimodal content environment. To address this challenge, this paper proposes a multimodal foundation model-based framework for user interest modeling and behavior analysis. By integrating video frames, textual descriptions, and background music into a unified semantic space using cross-modal alignment strategies, the framework constructs fine-grained user interest vectors. Additionally, we introduce a behavior-driven feature embedding mechanism that incorporates viewing, liking, and commenting sequences to model dynamic interest evolution, thereby improving both the timeliness and accuracy of recommendations. In the experimental phase, we conduct extensive evaluations using both public and proprietary short video datasets, comparing our approach against multiple mainstream recommendation algorithms and modeling techniques. Results demonstrate significant improvements in behavior prediction accuracy, interest modeling for cold-start users, and recommendation click-through rates. Moreover, we incorporate interpretability mechanisms using attention weights and feature visualization to reveal the model's decision basis under multimodal inputs and trace interest shifts, thereby enhancing the transparency and controllability of the recommendation system.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes