Fan Du

HC
h-index3
21papers
437citations
Novelty50%
AI Score57

21 Papers

97.8CVMay 30Code
Towards Sparse Video Understanding and Reasoning

Chenwei Xu, Zhen Ye, Shang Wu et al.

We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play'' setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.

SIApr 5, 2022
CGC: Contrastive Graph Clustering for Community Detection and Tracking

Namyong Park, Ryan Rossi, Eunyee Koh et al.

Given entities and their interactions in the web data, which may have occurred at different time, how can we find communities of entities and track their evolution? In this paper, we approach this important task from graph clustering perspective. Recently, state-of-the-art clustering performance in various domains has been achieved by deep clustering methods. Especially, deep graph clustering (DGC) methods have successfully extended deep clustering to graph-structured data by learning node representations and cluster assignments in a joint optimization framework. Despite some differences in modeling choices (e.g., encoder architectures), existing DGC methods are mainly based on autoencoders and use the same clustering objective with relatively minor adaptations. Also, while many real-world graphs are dynamic, previous DGC methods considered only static graphs. In this work, we develop CGC, a novel end-to-end framework for graph clustering, which fundamentally differs from existing methods. CGC learns node embeddings and cluster assignments in a contrastive graph learning framework, where positive and negative samples are carefully selected in a multi-level scheme such that they reflect hierarchical community structures and network homophily. Also, we extend CGC for time-evolving data, where temporal graph clustering is performed in an incremental learning fashion, with the ability to detect change points. Extensive evaluation on real-world graphs demonstrates that the proposed CGC consistently outperforms existing methods.

IRJul 26, 2022
Bundle MCR: Towards Conversational Bundle Recommendation

Zhankui He, Handong Zhao, Tong Yu et al.

Bundle recommender systems recommend sets of items (e.g., pants, shirt, and shoes) to users, but they often suffer from two issues: significant interaction sparsity and a large output space. In this work, we extend multi-round conversational recommendation (MCR) to alleviate these issues. MCR, which uses a conversational paradigm to elicit user interests by asking user preferences on tags (e.g., categories or attributes) and handling user feedback across multiple rounds, is an emerging recommendation setting to acquire user feedback and narrow down the output space, but has not been explored in the context of bundle recommendation. In this work, we propose a novel recommendation task named Bundle MCR. We first propose a new framework to formulate Bundle MCR as Markov Decision Processes (MDPs) with multiple agents, for user modeling, consultation and feedback handling in bundle contexts. Under this framework, we propose a model architecture, called Bundle Bert (Bunt) to (1) recommend items, (2) post questions and (3) manage conversations based on bundle-aware conversation states. Moreover, to train Bunt effectively, we propose a two-stage training strategy. In an offline pre-training stage, Bunt is trained using multiple cloze tasks to mimic bundle interactions in conversations. Then in an online fine-tuning stage, Bunt agents are enhanced by user interactions. Our experiments on multiple offline datasets as well as the human evaluation show the value of extending MCR frameworks to bundle settings and the effectiveness of our Bunt design.

LGDec 28, 2022
PersonaSAGE: A Multi-Persona Graph Neural Network

Gautam Choudhary, Iftikhar Ahamath Burhanuddin, Eunyee Koh et al.

Graph Neural Networks (GNNs) have become increasingly important in recent years due to their state-of-the-art performance on many important downstream applications. Existing GNNs have mostly focused on learning a single node representation, despite that a node often exhibits polysemous behavior in different contexts. In this work, we develop a persona-based graph neural network framework called PersonaSAGE that learns multiple persona-based embeddings for each node in the graph. Such disentangled representations are more interpretable and useful than a single embedding. Furthermore, PersonaSAGE learns the appropriate set of persona embeddings for each node in the graph, and every node can have a different number of assigned persona embeddings. The framework is flexible enough and the general design helps in the wide applicability of the learned embeddings to suit the domain. We utilize publicly available benchmark datasets to evaluate our approach and against a variety of baselines. The experiments demonstrate the effectiveness of PersonaSAGE for a variety of important tasks including link prediction where we achieve an average gain of 15% while remaining competitive for node classification. Finally, we also demonstrate the utility of PersonaSAGE with a case study for personalized recommendation of different entity types in a data management platform.

CVMar 3
PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

Shang Wu, Chenwei Xu, Zhuofan Xia et al.

State-of-the-art text-to-video (T2V) generators frequently violate physical laws despite high visual quality. We show this stems from insufficient physical constraints in prompts rather than model limitations: manually adding physics details reliably produces physically plausible videos, but requires expertise and does not scale. We present PhyPrompt, a two-stage reinforcement learning framework that automatically refines prompts for physically realistic generation. First, we fine-tune a large language model on a physics-focused Chain-of-Thought dataset to integrate principles like object motion and force interactions while preserving user intent. Second, we apply Group Relative Policy Optimization with a dynamic reward curriculum that initially prioritizes semantic fidelity, then progressively shifts toward physical commonsense. This curriculum achieves synergistic optimization: PhyPrompt-7B reaches 40.8\% joint success on VideoPhy2 (8.6pp gain), improving physical commonsense by 11pp (55.8\% to 66.8\%) while simultaneously increasing semantic adherence by 4.4pp (43.4\% to 47.8\%). Remarkably, our curriculum exceeds single-objective training on both metrics, demonstrating compositional prompt discovery beyond conventional multi-objective trade-offs. PhyPrompt outperforms GPT-4o (+3.8\% joint) and DeepSeek-V3 (+2.2\%, 100$\times$ larger) using only 7B parameters. The approach transfers zero-shot across diverse T2V architectures (Lavie, VideoCrafter2, CogVideoX-5B) with up to 16.8\% improvement, establishing that domain-specialized reinforcement learning with compositional curricula surpasses general-purpose scaling for physics-aware generation.

90.1CVApr 27Code
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

Fan Du, Feng Yan, Jianxiong Wu et al.

Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade-off under real-time constraints. We address this issue by rethinking the role of the starting point in generative action modeling. Instead of shortening the sampling trajectory, we propose CF-VLA, a coarse-to-fine two-stage formulation that restructures action generation into a coarse initialization step that constructs an action-aware starting point, followed by a single-step local refinement that corrects residual errors. Concretely, the coarse stage learns a conditional posterior over endpoint velocity to transform Gaussian noise into a structured initialization, while the fine stage performs a fixed-time refinement from this initialization. To stabilize training, we introduce a stepwise strategy that first learns a controlled coarse predictor and then performs joint optimization. Experiments on CALVIN and LIBERO show that our method establishes a strong efficiency-performance frontier under low-NFE (Number of Function Evaluations) regimes: it consistently outperforms existing NFE=2 methods, matches or surpasses the NFE=10 $π_{0.5}$ baseline on several metrics, reduces action sampling latency by 75.4\%, and achieves the best average real-robot success rate of 83.0\%, outperforming MIP by 19.5 points and $π_{0.5}$ by 4.0 points. These results suggest that structured, coarse-to-fine generation enables both strong performance and efficient inference. Our code is available at https://github.com/EmbodiedAI-RoboTron/CF-VLA.

CVMar 3
Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Haoran Lu, Shang Wu, Jianshu Zhang et al.

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

CVDec 26, 2025
DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation

Divyansh Srivastava, Akshay Mehra, Pranav Maneriker et al.

Decoder-only autoregressive image generation typically relies on fixed-length tokenization schemes whose token counts grow quadratically with resolution, substantially increasing the computational and memory demands of attention. We present DPAR, a novel decoder-only autoregressive model that dynamically aggregates image tokens into a variable number of patches for efficient image generation. Our work is the first to demonstrate that next-token prediction entropy from a lightweight and unsupervised autoregressive model provides a reliable criterion for merging tokens into larger patches based on information content. DPAR makes minimal modifications to the standard decoder architecture, ensuring compatibility with multimodal generation frameworks and allocating more compute to generation of high-information image regions. Further, we demonstrate that training with dynamically sized patches yields representations that are robust to patch boundaries, allowing DPAR to scale to larger patch sizes at inference. DPAR reduces token count by 1.81x and 2.06x on Imagenet 256 and 384 generation resolution respectively, leading to a reduction of up to 40% FLOPs in training costs. Further, our method exhibits faster convergence and improves FID by up to 27.1% relative to baseline models.

CVSep 6, 2025Code
PictOBI-20k: Unveiling Large Multimodal Models in Visual Decipherment for Pictographic Oracle Bone Characters

Zijian Chen, Wenjie Hua, Jinhao Li et al.

Deciphering oracle bone characters (OBCs), the oldest attested form of written Chinese, has remained the ultimate, unwavering goal of scholars, offering an irreplaceable key to understanding humanity's early modes of production. Current decipherment methodologies of OBC are primarily constrained by the sporadic nature of archaeological excavations and the limited corpus of inscriptions. With the powerful visual perception capability of large multimodal models (LMMs), the potential of using LMMs for visually deciphering OBCs has increased. In this paper, we introduce PictOBI-20k, a dataset designed to evaluate LMMs on the visual decipherment tasks of pictographic OBCs. It includes 20k meticulously collected OBC and real object images, forming over 15k multi-choice questions. We also conduct subjective annotations to investigate the consistency of the reference point between humans and LMMs in visual reasoning. Experiments indicate that general LMMs possess preliminary visual decipherment skills, and LMMs are not effectively using visual information, while most of the time they are limited by language priors. We hope that our dataset can facilitate the evaluation and optimization of visual attention in future OBC-oriented LMMs. The code and dataset will be available at https://github.com/OBI-Future/PictOBI-20k.

CVNov 23, 2025
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

Xiyang Wu, Zongxia Li, Jihui Jin et al.

Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-driven reasoning involving motion dynamics and spatial interactions. This limitation reduces their ability to interpret real or AI-generated content (AIGC) videos and to generate physically consistent content. We present an approach that addresses this gap by translating physical-world context cues into interpretable representations aligned with VLMs' perception, comprehension, and reasoning. We introduce MASS-Bench, a comprehensive benchmark consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections, sub-segment grounding, and full-sequence 3D motion tracking of entities. We further present MASS, a model-agnostic method that injects spatial-temporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. To strengthen cross-modal alignment and reasoning, we apply reinforcement fine-tuning. Experiments and ablations show that our refined VLMs outperform comparable and larger baselines, as well as prior state-of-the-art models, by 8.7% and 6.0%, achieving performance comparable to close-source SoTA VLMs such as Gemini-2.5-Flash on physics reasoning and comprehension. These results validate the effectiveness of our approach.

HCSep 6, 2021
An Evaluation-Focused Framework for Visualization Recommendation Algorithms

Zehua Zeng, Phoebe Moh, Fan Du et al.

Although we have seen a proliferation of algorithms for recommending visualizations, these algorithms are rarely compared with one another, making it difficult to ascertain which algorithm is best for a given visual analysis scenario. Though several formal frameworks have been proposed in response, we believe this issue persists because visualization recommendation algorithms are inadequately specified from an evaluation perspective. In this paper, we propose an evaluation-focused framework to contextualize and compare a broad range of visualization recommendation algorithms. We present the structure of our framework, where algorithms are specified using three components: (1) a graph representing the full space of possible visualization designs, (2) the method used to traverse the graph for potential candidates for recommendation, and (3) an oracle used to rank candidate designs. To demonstrate how our framework guides the formal comparison of algorithmic performance, we not only theoretically compare five existing representative recommendation algorithms, but also empirically compare four new algorithms generated based on our findings from the theoretical comparison. Our results show that these algorithms behave similarly in terms of user performance, highlighting the need for more rigorous formal comparisons of recommendation algorithms to further clarify their benefits in various analysis scenarios.

HCAug 4, 2021
VBridge: Connecting the Dots Between Features and Data to Explain Healthcare Models

Furui Cheng, Dongyu Liu, Fan Du et al.

Machine learning (ML) is increasingly applied to Electronic Health Records (EHRs) to solve clinical prediction tasks. Although many ML models perform promisingly, issues with model transparency and interpretability limit their adoption in clinical practice. Directly using existing explainable ML techniques in clinical settings can be challenging. Through literature surveys and collaborations with six clinicians with an average of 17 years of clinical experience, we identified three key challenges, including clinicians' unfamiliarity with ML features, lack of contextual information, and the need for cohort-level evidence. Following an iterative design process, we further designed and developed VBridge, a visual analytics tool that seamlessly incorporates ML explanations into clinicians' decision-making workflow. The system includes a novel hierarchical display of contribution-based feature explanations and enriched interactions that connect the dots between ML features, explanations, and data. We demonstrated the effectiveness of VBridge through two case studies and expert interviews with four clinicians, showing that visually associating model explanations with patients' situational records can help clinicians better interpret and use model predictions when making clinician decisions. We further derived a list of design implications for developing future explainable ML tools to support clinical decision-making.

HCMar 21, 2021
Insight-centric Visualization Recommendation

Camille Harris, Ryan A. Rossi, Sana Malik et al.

Visualization recommendation systems simplify exploratory data analysis (EDA) and make understanding data more accessible to users of all skill levels by automatically generating visualizations for users to explore. However, most existing visualization recommendation systems focus on ranking all visualizations into a single list or set of groups based on particular attributes or encodings. This global ranking makes it difficult and time-consuming for users to find the most interesting or relevant insights. To address these limitations, we introduce a novel class of visualization recommendation systems that automatically rank and recommend both groups of related insights as well as the most important insights within each group. Our proposed approach combines results from many different learning-based methods to discover insights automatically. A key advantage is that this approach generalizes to a wide variety of attribute types such as categorical, numerical, and temporal, as well as complex non-trivial combinations of these different attribute types. To evaluate the effectiveness of our approach, we implemented a new insight-centric visualization recommendation system, SpotLight, which generates and ranks annotated visualizations to explain each insight. We conducted a user study with 12 participants and two datasets which showed that users are able to quickly understand and find relevant insights in unfamiliar data.

HCMar 6, 2021
ChartStory: Automated Partitioning, Layout, and Captioning of Charts into Comic-Style Narratives

Jian Zhao, Shenyu Xu, Senthil Chandrasegaran et al.

Visual data storytelling is gaining importance as a means of presenting data-driven information or analysis results, especially to the general public. This has resulted in design principles being proposed for data-driven storytelling, and new authoring tools being created to aid such storytelling. However, data analysts typically lack sufficient background in design and storytelling to make effective use of these principles and authoring tools. To assist this process, we present ChartStory for crafting data stories from a collection of user-created charts, using a style akin to comic panels to imply the underlying sequence and logic of data-driven narratives. Our approach is to operationalize established design principles into an advanced pipeline which characterizes charts by their properties and similarity, and recommends ways to partition, layout, and caption story pieces to serve a narrative. ChartStory also augments this pipeline with intuitive user interactions for visual refinement of generated data comics. We extensively and holistically evaluate ChartStory via a trio of studies. We first assess how the tool supports data comic creation in comparison to a manual baseline tool. Data comics from this study are subsequently compared and evaluated to ChartStory's automated recommendations by a team of narrative visualization practitioners. This is followed by a pair of interview studies with data scientists using their own datasets and charts who provide an additional assessment of the system. We find that ChartStory provides cogent recommendations for narrative generation, resulting in data comics that compare favorably to manually-created ones.

IRFeb 12, 2021
Personalized Visualization Recommendation

Xin Qian, Ryan A. Rossi, Fan Du et al.

Visualization recommendation work has focused solely on scoring visualizations based on the underlying dataset and not the actual user and their past visualization feedback. These systems recommend the same visualizations for every user, despite that the underlying user interests, intent, and visualization preferences are likely to be fundamentally different, yet vitally important. In this work, we formally introduce the problem of personalized visualization recommendation and present a generic learning framework for solving it. In particular, we focus on recommending visualizations personalized for each individual user based on their past visualization interactions (e.g., viewed, clicked, manually created) along with the data from those visualizations. More importantly, the framework can learn from visualizations relevant to other users, even if the visualizations are generated from completely different datasets. Experiments demonstrate the effectiveness of the approach as it leads to higher quality visualization recommendations tailored to the specific user intent and preferences. To support research on this new problem, we release our user-centric visualization corpus consisting of 17.4k users exploring 94k datasets with 2.3 million attributes and 32k user-generated visualizations.

HCFeb 3, 2021
InfoColorizer: Interactive Recommendation of Color Palettes for Infographics

Lin-Ping Yuan, Ziqi Zhou, Jian Zhao et al.

When designing infographics, general users usually struggle with getting desired color palettes using existing infographic authoring tools, which sometimes sacrifice customizability, require design expertise, or neglect the influence of elements' spatial arrangement. We propose a data-driven method that provides flexibility by considering users' preferences, lowers the expertise barrier via automation, and tailors suggested palettes to the spatial layout of elements. We build a recommendation engine by utilizing deep learning techniques to characterize good color design practices from data, and further develop InfoColorizer, a tool that allows users to obtain color palettes for their infographics in an interactive and dynamic manner. To validate our method, we conducted a comprehensive four-part evaluation, including case studies, a controlled user study, a survey study, and an interview study. The results indicate that InfoColorizer can provide compelling palette recommendations with adequate flexibility, allowing users to effectively obtain high-quality color design for input infographics with low effort.

IRSep 25, 2020
ML-based Visualization Recommendation: Learning to Recommend Visualizations from Data

Xin Qian, Ryan A. Rossi, Fan Du et al.

Visualization recommendation seeks to generate, score, and recommend to users useful visualizations automatically, and are fundamentally important for exploring and gaining insights into a new or existing dataset quickly. In this work, we propose the first end-to-end ML-based visualization recommendation system that takes as input a large corpus of datasets and visualizations, learns a model based on this data. Then, given a new unseen dataset from an arbitrary user, the model automatically generates visualizations for that new dataset, derive scores for the visualizations, and output a list of recommended visualizations to the user ordered by effectiveness. We also describe an evaluation framework to quantitatively evaluate visualization recommendation models learned from a large corpus of visualizations and datasets. Through quantitative experiments, a user study, and qualitative analysis, we show that our end-to-end ML-based system recommends more effective and useful visualizations compared to existing state-of-the-art rule-based systems. Finally, we observed a strong preference by the human experts in our user study towards the visualizations recommended by our ML-based system as opposed to the rule-based system (5.92 from a 7-point Likert scale compared to only 3.45).

HCSep 5, 2020
A Visual Analytics Approach for Exploratory Causal Analysis: Exploration, Validation, and Applications

Xiao Xie, Fan Du, Yingcai Wu

Using causal relations to guide decision making has become an essential analytical task across various domains, from marketing and medicine to education and social science. While powerful statistical models have been developed for inferring causal relations from data, domain practitioners still lack effective visual interface for interpreting the causal relations and applying them in their decision-making process. Through interview studies with domain experts, we characterize their current decision-making workflows, challenges, and needs. Through an iterative design process, we developed a visualization tool that allows analysts to explore, validate, and apply causal relations in real-world decision-making scenarios. The tool provides an uncertainty-aware causal graph visualization for presenting a large set of causal relations inferred from high-dimensional data. On top of the causal graph, it supports a set of intuitive user controls for performing what-if analyses and making action plans. We report on two case studies in marketing and student advising to demonstrate that users can effectively explore causal relations and design action plans for reaching their goals.

HCApr 26, 2020
The Impact of Presentation Style on Human-In-The-Loop Detection of Algorithmic Bias

Po-Ming Law, Sana Malik, Fan Du et al.

While decision makers have begun to employ machine learning, machine learning models may make predictions that bias against certain demographic groups. Semi-automated bias detection tools often present reports of automatically-detected biases using a recommendation list or visual cues. However, there is a lack of guidance concerning which presentation style to use in what scenarios. We conducted a small lab study with 16 participants to investigate how presentation style might affect user behaviors in reviewing bias reports. Participants used both a prototype with a recommendation list and a prototype with visual cues for bias detection. We found that participants often wanted to investigate the performance measures that were not automatically detected as biases. Yet, when using the prototype with a recommendation list, they tended to give less consideration to such measures. Grounded in the findings, we propose information load and comprehensiveness as two axes for characterizing bias detection tasks and illustrate how the two axes could be adopted to reason about when to use a recommendation list or visual cues.

CYMar 13, 2020
Designing Tools for Semi-Automated Detection of Machine Learning Biases: An Interview Study

Po-Ming Law, Sana Malik, Fan Du et al.

Machine learning models often make predictions that bias against certain subgroups of input data. When undetected, machine learning biases can constitute significant financial and ethical implications. Semi-automated tools that involve humans in the loop could facilitate bias detection. Yet, little is known about the considerations involved in their design. In this paper, we report on an interview study with 11 machine learning practitioners for investigating the needs surrounding semi-automated bias detection tools. Based on the findings, we highlight four considerations in designing to guide system designers who aim to create future tools for bias detection.

HCJul 18, 2015
MetroViz: Visual Analysis of Public Transportation Data

Fan Du, Joshua Brulé, Peter Enns et al.

Understanding the quality and usage of public transportation resources is important for schedule optimization and resource allocation. Ridership and adherence are the two main dimensions for evaluating the quality of service. Using Automatic Vehicle Location (AVL), Automatic Passenger Count (APC), and Global Positioning System (GPS) data, ridership data and adherence data of public transportation can be collected. In this paper, we discuss the development of a visualization tool for exploring public transportation data. We introduce "map view" and "route view" to help users locate stops in the context of geography and route information. To visualize ridership and adherence information over several years, we introduce "calendar view" - a miniaturized calendar that provides an overview of data where users can interactively select specific days to explore individual trips and stops ("trip subview" and "stop subview"). MetroViz was evaluated via a series of usability tests that included researchers from the Center for Advanced Transportation Technology (CATT) and students from the University of Maryland - College Park in which test participants used the tool to explore three years of bus transit data from Blacksburg, Virginia.