Chuyuan Wang

CV
h-index14
5papers
91citations
Novelty57%
AI Score39

5 Papers

CVNov 2, 2023Code
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

Yifan Du, Hangyu Guo, Kun Zhou et al.

Visual instruction tuning is crucial for enhancing the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs). In this paper, we aim to investigate a fundamental question: ''what makes for good visual instructions''. Through a comprehensive empirical study, we find that instructions focusing on complex visual reasoning tasks are particularly effective in improving the performance of MLLMs, with results correlating to instruction complexity. Based on this insight, we develop a systematic approach to automatically create high-quality complex visual reasoning instructions. Our approach employs a synthesize-complicate-reformulate paradigm, leveraging multiple stages to gradually increase the complexity of the instructions while guaranteeing quality. Based on this approach, we create the ComVint dataset with 32K examples, and fine-tune four MLLMs on it. Experimental results consistently demonstrate the enhanced performance of all compared MLLMs, such as a 27.86% and 27.60% improvement for LLaVA on MME-Perception and MME-Cognition, respectively. Our code and data are publicly available at the link: https://github.com/RUCAIBox/ComVint.

LGJul 17, 2024Code
Conditional Quantile Estimation for Uncertain Watch Time in Short-Video Recommendation

Chengzhi Lin, Shuchang Liu, Chuyuan Wang et al.

Accurately predicting watch time is crucial for optimizing recommendations and user experience in short video platforms. However, existing methods that estimate a single average watch time often fail to capture the inherent uncertainty in user engagement patterns. In this paper, we propose Conditional Quantile Estimation (CQE) to model the entire conditional distribution of watch time. Using quantile regression, CQE characterizes the complex watch-time distribution for each user-video pair, providing a flexible and comprehensive approach to understanding user behavior. We further design multiple strategies to combine the quantile estimates, adapting to different recommendation scenarios and user preferences. Extensive offline experiments and online A/B tests demonstrate the superiority of CQE in watch-time prediction and user engagement modeling. Specifically, deploying CQE online on a large-scale platform with hundreds of millions of daily active users has led to substantial gains in key evaluation metrics, including active days, engagement time, and video views. These results highlight the practical impact of our proposed approach in enhancing the user experience and overall performance of the short video recommendation system. The code will be released https://github.com/justopit/CQE.

CLAug 23, 2023
PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine

Chenrui Zhang, Lin Liu, Jinpeng Wang et al.

As an effective tool for eliciting the power of Large Language Models (LLMs), prompting has recently demonstrated unprecedented abilities across a variety of complex tasks. To further improve the performance, prompt ensemble has attracted substantial interest for tackling the hallucination and instability of LLMs. However, existing methods usually adopt a two-stage paradigm, which requires a pre-prepared set of prompts with substantial manual effort, and is unable to perform directed optimization for different weak learners. In this paper, we propose a simple, universal, and automatic method named PREFER (Pompt Ensemble learning via Feedback-Reflect-Refine) to address the stated limitations. Specifically, given the fact that weak learners are supposed to focus on hard examples during boosting, PREFER builds a feedback mechanism for reflecting on the inadequacies of existing weak learners. Based on this, the LLM is required to automatically synthesize new prompts for iterative refinement. Moreover, to enhance stability of the prompt effect evaluation, we propose a novel prompt bagging method involving forward and backward thinking, which is superior to majority voting and is beneficial for both feedback and weight calculation in boosting. Extensive experiments demonstrate that our PREFER achieves state-of-the-art performance in multiple types of tasks by a significant margin. We have made our code publicly available.

IRSep 15, 2024
Dreaming User Multimodal Representation Guided by The Platonic Representation Hypothesis for Micro-Video Recommendation

Chengzhi Lin, Hezheng Lin, Shuchang Liu et al.

The proliferation of online micro-video platforms has underscored the necessity for advanced recommender systems to mitigate information overload and deliver tailored content. Despite advancements, accurately and promptly capturing dynamic user interests remains a formidable challenge. Inspired by the Platonic Representation Hypothesis, which posits that different data modalities converge towards a shared statistical model of reality, we introduce DreamUMM (Dreaming User Multi-Modal Representation), a novel approach leveraging user historical behaviors to create real-time user representation in a multimoda space. DreamUMM employs a closed-form solution correlating user video preferences with multimodal similarity, hypothesizing that user interests can be effectively represented in a unified multimodal space. Additionally, we propose Candidate-DreamUMM for scenarios lacking recent user behavior data, inferring interests from candidate videos alone. Extensive online A/B tests demonstrate significant improvements in user engagement metrics, including active days and play count. The successful deployment of DreamUMM in two micro-video platforms with hundreds of millions of daily active users, illustrates its practical efficacy and scalability in personalized micro-video content delivery. Our work contributes to the ongoing exploration of representational convergence by providing empirical evidence supporting the potential for user interest representations to reside in a multimodal space.

CVJun 18, 2025
Baltimore Atlas: FreqWeaver Adapter for Semi-supervised Ultra-high Spatial Resolution Land Cover Classification

Junhao Wu, Aboagye-Ntow Stephen, Chuyuan Wang et al.

Ultra-high Spatial Resolution (UHSR) Land Cover Classification is increasingly important for urban analysis, enabling fine-scale planning, ecological monitoring, and infrastructure management. It identifies land cover types on sub-meter remote sensing imagery, capturing details such as building outlines, road networks, and distinct boundaries. However, most existing methods focus on 1 m imagery and rely heavily on large-scale annotations, while UHSR data remain scarce and difficult to annotate, limiting practical applicability. To address these challenges, we introduce Baltimore Atlas, a UHSR land cover classification framework that reduces reliance on large-scale training data and delivers high-accuracy results. Baltimore Atlas builds on three key ideas: (1) Baltimore Atlas Dataset, a 0.3 m resolution dataset based on aerial imagery of Baltimore City; (2) FreqWeaver Adapter, a parameter-efficient adapter that transfers SAM2 to this domain, leveraging foundation model knowledge to reduce training data needs while enabling fine-grained detail and structural modeling; (3) Uncertainty-Aware Teacher Student Framework, a semi-supervised framework that exploits unlabeled data to further reduce training dependence and improve generalization across diverse scenes. Using only 5.96% of total model parameters, our approach achieves a 1.78% IoU improvement over existing parameter-efficient tuning strategies and a 3.44% IoU gain compared to state-of-the-art high-resolution remote sensing segmentation methods on the Baltimore Atlas Dataset.