72.5CLMay 14
Agentic Recommender System with Hierarchical Belief-State MemoryXiang Shen, Yuhang Zhou, Yifan Wu et al.
Memory-augmented LLM agents have advanced personalized recommendation, yet existing approaches universally adopt flat memory representations that conflate ephemeral signals with stable preferences, and none provides a complete lifecycle governing how memory should evolve. We propose MARS (Memory-Augmented Agentic Recommender System), a framework that treats recommendation as a partially observable problem and maintains a structured belief state that progressively abstracts noisy behavioral observations into a compact estimate of user preferences. MARS organizes this belief state into three tiers: event memory buffers raw signals, preference memory maintains fine-grained mutable chunks with explicit strength and evidence tracking, and profile memory distills all preferences into a coherent natural language narrative. A complete lifecycle of six operations -- extraction, reinforcement, weakening, consolidation, forgetting, and resynthesis -- is adaptively scheduled by an LLM-based planner rather than fixed-interval heuristics. Experiments on four InstructRec benchmark domains show that \ours achieves state-of-the-art performance with average improvements of 26.4% in HR@1 and 10.3% in NDCG@10 over the strongest baselines with further gains from agentic scheduling in evolving settings.
CVSep 15, 2025Code
SFGNet: Semantic and Frequency Guided Network for Camouflaged Object DetectionDezhen Wang, Haixiang Zhao, Xiang Shen et al.
Camouflaged object detection (COD) aims to segment objects that blend into their surroundings. However, most existing studies overlook the semantic differences among textual prompts of different targets as well as fine-grained frequency features. In this work, we propose a novel Semantic and Frequency Guided Network (SFGNet), which incorporates semantic prompts and frequency-domain features to capture camouflaged objects and improve boundary perception. We further design Multi-Band Fourier Module(MBFM) to enhance the ability of the network in handling complex backgrounds and blurred boundaries. In addition, we design an Interactive Structure Enhancement Block (ISEB) to ensure structural integrity and boundary details in the predictions. Extensive experiments conducted on three COD benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches. The core code of the model is available at the following link: https://github.com/winter794444/SFGNetICASSP2026.
MMMar 21, 2025
Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data ExpansionYu Sun, Yin Li, Ruixiao Sun et al.
Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content modeling, yet traditional statistical-based active learning (AL) methods face limitations: they struggle to detect overconfident misclassifications and are less effective in distinguishing semantically similar items in deep neural networks. Additionally, audio information plays an increasing role, especially in short-video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training from scratch across all three modalities is possible, it sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address these challenges, we propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency and Vision-Language Modeling with Audio Enhancement (VLMAE), a mid-fusion approach integrating audio into VL models. This system deployed in production systems, leading to significant business gains.
CVSep 25, 2025
Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content GovernanceZixuan Wang, Yu Sun, Hongwei Wang et al.
Short video platforms are evolving rapidly, making the identification of inappropriate content increasingly critical. Existing approaches typically train separate and small classification models for each type of issue, which requires extensive human-labeled data and lacks cross-issue generalization. We propose a reasoning-enhanced multimodal large language model (MLLM) pretraining paradigm for unified inappropriate content detection. To address the distribution gap between short video content and the original pretraining data of MLLMs, as well as the complex issue definitions, we introduce three targeted pretraining tasks: (1) \textit{Caption}, to enhance the MLLM's perception of video details; (2) \textit{Visual Question Answering (VQA)}, to deepen the MLLM's understanding of issue definitions and annotation guidelines; (3) \textit{Chain-of-Thought (CoT)}, to enhance the MLLM's reasoning capability. Experimental results show that our pretraining approach significantly improves the MLLM's performance in both zero-shot and supervised fine-tuning (SFT) settings. In addition, our pretrained model demonstrates strong generalization capabilities to emergent, previously unseen issues.
LGJul 23, 2025
Filter-And-Refine: A MLLM Based Cascade System for Industrial-Scale Video Content ModerationZixuan Wang, Jinghao Shi, Hanzhong Liang et al.
Effective content moderation is essential for video platforms to safeguard user experience and uphold community standards. While traditional video classification models effectively handle well-defined moderation tasks, they struggle with complicated scenarios such as implicit harmful content and contextual ambiguity. Multimodal large language models (MLLMs) offer a promising solution to these limitations with their superior cross-modal reasoning and contextual understanding. However, two key challenges hinder their industrial adoption. First, the high computational cost of MLLMs makes full-scale deployment impractical. Second, adapting generative models for discriminative classification remains an open research problem. In this paper, we first introduce an efficient method to transform a generative MLLM into a multimodal classifier using minimal discriminative training data. To enable industry-scale deployment, we then propose a router-ranking cascade system that integrates MLLMs with a lightweight router model. Offline experiments demonstrate that our MLLM-based approach improves F1 score by 66.50% over traditional classifiers while requiring only 2% of the fine-tuning data. Online evaluations show that our system increases automatic content moderation volume by 41%, while the cascading deployment reduces computational cost to only 1.5% of direct full-scale deployment.
IRJun 30, 2025
Embedding-based Retrieval in Multimodal Content ModerationHanzhong Liang, Jinghao Shi, Xiang Shen et al.
Video understanding plays a fundamental role for content moderation on short video platforms, enabling the detection of inappropriate content. While classification remains the dominant approach for content moderation, it often struggles in scenarios requiring rapid and cost-efficient responses, such as trend adaptation and urgent escalations. To address this issue, we introduce an Embedding-Based Retrieval (EBR) method designed to complement traditional classification approaches. We first leverage a Supervised Contrastive Learning (SCL) framework to train a suite of foundation embedding models, including both single-modal and multi-modal architectures. Our models demonstrate superior performance over established contrastive learning methods such as CLIP and MoCo. Building on these embedding models, we design and implement the embedding-based retrieval system that integrates embedding generation and video retrieval to enable efficient and effective trend handling. Comprehensive offline experiments on 25 diverse emerging trends show that EBR improves ROC-AUC from 0.85 to 0.99 and PR-AUC from 0.35 to 0.95. Further online experiments reveal that EBR increases action rates by 10.32% and reduces operational costs by over 80%, while also enhancing interpretability and flexibility compared to classification-based solutions.
CLDec 15, 2024
AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMsMingchao Liu, Yu Sun, Ruixiao Sun et al.
The advanced processing and reasoning capabilities of multimodal large language models (MLLMs) have driven substantial progress in vision-language (VL) understanding tasks. However, while effective for tasks governed by straightforward logic, MLLMs often struggle with reasoning complex, detail-intensive logical structures. To address this limitation, we introduce AgentPS, a novel framework that integrates Agentic Process Supervision into MLLMs by sequentially reasoning over ancillary questions during fine-tuning. AgentPS achieves substantial improvements over baseline MLLMs on both public benchmarks and proprietary datasets. Notably, we show that using MLLM-generated ancillary labels in place of human annotations yields only minimal performance degradation, highlighting the method's scalability. These results establish AgentPS as a scalable and effective solution for complex multimodal classification in large-scale industrial applications.