MMMar 21, 2025
Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data ExpansionYu Sun, Yin Li, Ruixiao Sun et al.
Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content modeling, yet traditional statistical-based active learning (AL) methods face limitations: they struggle to detect overconfident misclassifications and are less effective in distinguishing semantically similar items in deep neural networks. Additionally, audio information plays an increasing role, especially in short-video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training from scratch across all three modalities is possible, it sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address these challenges, we propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency and Vision-Language Modeling with Audio Enhancement (VLMAE), a mid-fusion approach integrating audio into VL models. This system deployed in production systems, leading to significant business gains.
IRDec 14, 2024
USM: Unbiased Survey Modeling for Limiting Negative User Experiences in Recommendation SystemsChenghui Yu, Peiyi Li, Haoze Wu et al.
Reducing negative user experiences is essential for the success of recommendation platforms. Exposing users to inappropriate content could not only adversely affect users' psychological well-beings, but also potentially drive users away from the platform, sabotaging the platform's long-term success. However, recommendation algorithms tend to weigh more heavily on positive feedback signals due to the scarcity of negative ones, which may result in the neglect of valuable negative user feedback. In this paper, we propose an approach aimed at limiting negative user experiences. Our method primarily relies on distributing in-feed surveys to the users, modeling the users' feedback collected from the survey, and integrating the model predictions into the recommendation system. We further enhance the baseline survey model by integrating the Learning Hidden Unit Contributions module and the Squeeze-and-Excitation module. In addition, we strive to resolve the problem of response Bias by applying a survey-submit model; The A/B testing results indicate a reduction in survey sexual rate and survey inappropriate rate, ranging from -1.44\% to -3.9\%. Additionally, we compared our methods against an online baseline that does not incorporate our approach. The results indicate that our approach significantly reduces the report rate and dislike rate by 1\% to 2.27\% compared to the baseline, confirming the effectiveness of our methods in enhancing user experience. After we launched the survey model based our approach on our platform, the model is able to bring reductions of 1.75\%, 2.57\%, 2.06\% on reports, dislikes, survey inappropriate rate, respectively.
CVDec 11, 2024
COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM FrameworkXin Dong, Sen Jia, Ming Rui Wang et al.
Recently, with the emergence of recent Multimodal Large Language Model (MLLM) technology, it has become possible to exploit its video understanding capability on different classification tasks. In practice, we face the difficulty of huge requirements for GPU resource if we need to deploy MLLMs online. In this paper, we propose COEF-VQ, a novel cascaded MLLM framework designed to enhance video quality understanding on the short-video platform while optimizing computational efficiency. Our approach integrates an entropy-based pre-filtering stage, where a lightweight model assesses uncertainty and selectively filters cases before passing them to the more computationally intensive MLLM for final evaluation. By prioritizing high-uncertainty samples for deeper analysis, our framework significantly reduces GPU usage while maintaining the strong classification performance of a full MLLM deployment. To demonstrate the effectiveness of COEF-VQ, we deploy this new framework onto the video management platform (VMP) at the short-video platform, and perform a series of detailed experiments on two in-house tasks related to video quality understanding. We show that COEF-VQ leads to substantial performance gains from the offline evaluation in these two tasks and effectively enhances platform safety with limit resource consumption, significantly reducing inappropriate content video view rate by 9.9% in a online A/B test without affecting engagement. Post-launch monitoring confirmed sustained improvements, validating its real-world impact.
CVSep 25, 2025
Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content GovernanceZixuan Wang, Yu Sun, Hongwei Wang et al.
Short video platforms are evolving rapidly, making the identification of inappropriate content increasingly critical. Existing approaches typically train separate and small classification models for each type of issue, which requires extensive human-labeled data and lacks cross-issue generalization. We propose a reasoning-enhanced multimodal large language model (MLLM) pretraining paradigm for unified inappropriate content detection. To address the distribution gap between short video content and the original pretraining data of MLLMs, as well as the complex issue definitions, we introduce three targeted pretraining tasks: (1) \textit{Caption}, to enhance the MLLM's perception of video details; (2) \textit{Visual Question Answering (VQA)}, to deepen the MLLM's understanding of issue definitions and annotation guidelines; (3) \textit{Chain-of-Thought (CoT)}, to enhance the MLLM's reasoning capability. Experimental results show that our pretraining approach significantly improves the MLLM's performance in both zero-shot and supervised fine-tuning (SFT) settings. In addition, our pretrained model demonstrates strong generalization capabilities to emergent, previously unseen issues.
LGJul 23, 2025
Filter-And-Refine: A MLLM Based Cascade System for Industrial-Scale Video Content ModerationZixuan Wang, Jinghao Shi, Hanzhong Liang et al.
Effective content moderation is essential for video platforms to safeguard user experience and uphold community standards. While traditional video classification models effectively handle well-defined moderation tasks, they struggle with complicated scenarios such as implicit harmful content and contextual ambiguity. Multimodal large language models (MLLMs) offer a promising solution to these limitations with their superior cross-modal reasoning and contextual understanding. However, two key challenges hinder their industrial adoption. First, the high computational cost of MLLMs makes full-scale deployment impractical. Second, adapting generative models for discriminative classification remains an open research problem. In this paper, we first introduce an efficient method to transform a generative MLLM into a multimodal classifier using minimal discriminative training data. To enable industry-scale deployment, we then propose a router-ranking cascade system that integrates MLLMs with a lightweight router model. Offline experiments demonstrate that our MLLM-based approach improves F1 score by 66.50% over traditional classifiers while requiring only 2% of the fine-tuning data. Online evaluations show that our system increases automatic content moderation volume by 41%, while the cascading deployment reduces computational cost to only 1.5% of direct full-scale deployment.
CLDec 15, 2024
AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMsMingchao Liu, Yu Sun, Ruixiao Sun et al.
The advanced processing and reasoning capabilities of multimodal large language models (MLLMs) have driven substantial progress in vision-language (VL) understanding tasks. However, while effective for tasks governed by straightforward logic, MLLMs often struggle with reasoning complex, detail-intensive logical structures. To address this limitation, we introduce AgentPS, a novel framework that integrates Agentic Process Supervision into MLLMs by sequentially reasoning over ancillary questions during fine-tuning. AgentPS achieves substantial improvements over baseline MLLMs on both public benchmarks and proprietary datasets. Notably, we show that using MLLM-generated ancillary labels in place of human annotations yields only minimal performance degradation, highlighting the method's scalability. These results establish AgentPS as a scalable and effective solution for complex multimodal classification in large-scale industrial applications.
CVDec 16, 2018
Efficient Super Resolution Using Binarized Neural NetworkYinglan Ma, Hongyu Xiong, Zhe Hu et al.
Deep convolutional neural networks (DCNNs) have recently demonstrated high-quality results in single-image super-resolution (SR). DCNNs often suffer from over-parametrization and large amounts of redundancy, which results in inefficient inference and high memory usage, preventing massive applications on mobile devices. As a way to significantly reduce model size and computation time, binarized neural network has only been shown to excel on semantic-level tasks such as image classification and recognition. However, little effort of network quantization has been spent on image enhancement tasks like SR, as network quantization is usually assumed to sacrifice pixel-level accuracy. In this work, we explore an network-binarization approach for SR tasks without sacrificing much reconstruction accuracy. To achieve this, we binarize the convolutional filters in only residual blocks, and adopt a learnable weight for each binary filter. We evaluate this idea on several state-of-the-art DCNN-based architectures, and show that binarized SR networks achieve comparable qualitative and quantitative results as their real-weight counterparts. Moreover, the proposed binarized strategy could help reduce model size by 80% when applying on SRResNet, and could potentially speed up inference by 5 times.
CLDec 4, 2018
Transferable Natural Language Interface to Structured Queries aided by Adversarial GenerationHongyu Xiong, Ruixiao Sun
A natural language interface (NLI) to structured query is intriguing due to its wide industrial applications and high economical values. In this work, we tackle the problem of domain adaptation for NLI with limited data on target domain. Two important approaches are considered: (a) effective general-knowledge-learning on source domain semantic parsing, and (b) data augmentation on target domain. We present a Structured Query Inference Network (SQIN) to enhance learning for domain adaptation, by separating schema information from NL and decoding SQL in a more structural-aware manner; we also propose a GAN-based augmentation technique (AugmentGAN) to mitigate the issue of lacking target domain data. We report solid results on GeoQuery, Overnight, and WikiSQL to demonstrate state-of-the-art performances for both in-domain and domain-transfer tasks.
AISep 7, 2018
A Transfer-Learnable Natural Language Interface for DatabasesWenlu Wang, Yingtao Tian, Hongyu Xiong et al.
Relational database management systems (RDBMSs) are powerful because they are able to optimize and answer queries against any relational database. A natural language interface (NLI) for a database, on the other hand, is tailored to support that specific database. In this work, we introduce a general purpose transfer-learnable NLI with the goal of learning one model that can be used as NLI for any relational database. We adopt the data management principle of separating data and its schema, but with the additional support for the idiosyncrasy and complexity of natural languages. Specifically, we introduce an automatic annotation mechanism that separates the schema and the data, where the schema also covers knowledge about natural language. Furthermore, we propose a customized sequence model that translates annotated natural language queries to SQL statements. We show in experiments that our approach outperforms previous NLI methods on the WikiSQL dataset and the model we learned can be applied to another benchmark dataset OVERNIGHT without retraining.