14 Papers

CVJun 2
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Hao Zhong, Muzhi Zhu, Shenyan Zeng et al.

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.

ARMay 8
Five-Minute Rule 40 Years Later: A First-Principles Revisit for Modern Memory Hierarchy

Tong Zhang, Vikram Sharma Mailthody, Fei Sun et al.

In 1987, Jim Gray and Gianfranco Putzolu introduced the five-minute rule, a simple, storage-memory-economics-based heuristic for deciding when data should live in DRAM rather than on storage. Subsequent revisits to the rule largely retained that economics-only view, leaving host costs, feasibility limits, and workload behavior out of scope. This paper revisits the rule from first principles, integrating host costs, DRAM bandwidth/capacity, and physics-grounded models of SSD performance and cost, and then embedding these elements in a constraint- and workload-aware framework that yields actionable provisioning guidance. We show that, for modern AI platforms, especially GPU-centric hosts paired with ultra-high-IOPS SSDs engineered for fine-grained random access, the DRAM$\leftrightarrow$flash caching threshold collapses from minutes to a few seconds. This shift reframes NAND flash memory as an \emph{active data tier} and exposes a broad research space across the hardware-software stack. We further introduce MQSim-Next, a calibrated SSD simulator that supports validation and sensitivity analysis and facilitates future architectural and system research. Finally, we present two concrete case studies that showcase the software system design space opened by such memory hierarchy paradigm shift. Overall, we turn a classical heuristic into an actionable, feasibility-aware analysis and provisioning framework and set the stage for further research on AI-era memory hierarchy.

CVDec 8, 2025
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality

Zekai Luo, Zongze Du, Zhouhang Zhu et al.

Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video's expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap

LGFeb 9
LLaDA2.1: Speeding Up Text Diffusion via Token Editing

Tiwei Bie, Maosong Cao, Xiang Cao et al.

While LLaDA2.0 showcased the scaling potential of 100B-level block-diffusion models and their inherent parallelization, the delicate equilibrium between decoding speed and generation quality has remained an elusive frontier. Today, we unveil LLaDA2.1, a paradigm shift designed to transcend this trade-off. By seamlessly weaving Token-to-Token (T2T) editing into the conventional Mask-to-Token (M2T) scheme, we introduce a joint, configurable threshold-decoding scheme. This structural innovation gives rise to two distinct personas: the Speedy Mode (S Mode), which audaciously lowers the M2T threshold to bypass traditional constraints while relying on T2T to refine the output; and the Quality Mode (Q Mode), which leans into conservative thresholds to secure superior benchmark performances with manageable efficiency degrade. Furthering this evolution, underpinned by an expansive context window, we implement the first large-scale Reinforcement Learning (RL) framework specifically tailored for dLLMs, anchored by specialized techniques for stable gradient estimation. This alignment not only sharpens reasoning precision but also elevates instruction-following fidelity, bridging the chasm between diffusion dynamics and complex human intent. We culminate this work by releasing LLaDA2.1-Mini (16B) and LLaDA2.1-Flash (100B). Across 33 rigorous benchmarks, LLaDA2.1 delivers strong task performance and lightning-fast decoding speed. Despite its 100B volume, on coding tasks it attains an astounding 892 TPS on HumanEval+, 801 TPS on BigCodeBench, and 663 TPS on LiveCodeBench.

ROMar 8Code
PanoDP: Learning Collision-Free Navigation with Panoramic Depth and Differentiable Physics

Hao Zhong, Pei Chi, Jiang Zhao et al.

Autonomous collision-free navigation in cluttered environments requires safe decision-making under partial observability with both static structure and dynamic obstacles. We present \textbf{PanoDP}, a communication-free learning framework that combines four-view panoramic depth perception with differentiable-physics-based training signals. PanoDP encodes panoramic depth using a lightweight CNN and optimizes policies with dense differentiable collision and motion-feasibility terms, improving training stability beyond sparse terminal collisions. We evaluate PanoDP on a controlled ring-to-center benchmark with systematic sweeps over agent count, obstacle density/layout, and dynamic behaviors, and further test out-of-distribution generalization in an external simulator (e.g., AirSim). Across settings, PanoDP increases collision-free and completion rates over single-view and non-physics-guided baselines under matched training budgets, and ablations (view masking, rotation augmentation) confirm the policy leverages 360-degree information. Code will be open source upon acceptance.

SEMar 19, 2021Code
Locating Faulty Methods with a Mixed RNN and Attention Model

Shouliang Yang, Junming Cao, Hushuang Zeng et al.

IR-based fault localization approaches achieves promising results when locating faulty files by comparing a bug report with source code. Unfortunately, they become less effective to locate faulty methods. We conduct a preliminary study to explore its challenges, and identify three problems: the semantic gap problem, the representation sparseness problem, and the single revision problem. To tackle these problems, we propose MRAM, a mixed RNN and attention model, which combines bug-fixing features and method structured features to explore both implicit and explicit relevance between methods and bug reports for method level fault localization task. The core ideas of our model are: (1) constructing code revision graphs from code, commits and past bug reports, which reveal the latent relations among methods to augment short methods and as well provide all revisions of code and past fixes to train more accurate models; (2) embedding three method structured features (token sequences, API invocation sequences, and comments) jointly with RNN and soft attention to represent source methods and obtain their implicit relevance with bug reports; and (3) integrating multirevision bug-fixing features, which provide the explicit relevance between bug reports and methods, to improve the performance. We have implemented MRAM and conducted a controlled experiment on five open-source projects. Comparing with stateof-the-art approaches, our MRAM improves MRR values by 3.8- 5.1% (3.7-5.4%) when the dataset contains (does not contain) localized bug reports. Our statistics test shows that our improvements are significant

SEFeb 15, 2021Code
Investigating and Recommending Co-Changed Entities for JavaScript Programs

Zijian Jiang, Hao Zhong, Na Meng

JavaScript (JS) is one of the most popular programming languages due to its flexibility and versatility, but maintaining JS code is tedious and error-prone. In our research, we conducted an empirical study to characterize the relationship between co-changed software entities (e.g., functions and variables), and built a machine learning (ML)-based approach to recommend additional entity to edit given developers' code changes. Specifically, we first crawled 14,747 commits in 10 open-source projects; for each commit, we created one or more change dependency graphs (CDGs) to model the referencer-referencee relationship between co-changed entities. Next, we extracted the common subgraphs between CDGs to locate recurring co-change patterns between entities. Finally, based on those patterns, we extracted code features from co-changed entities and trained an ML model that recommends entities-to-change given a program commit. According to our empirical investigation, (1) three recurring patterns commonly exist in all projects; (2) 80%--90% of co-changed function pairs either invoke the same function(s), access the same variable(s), or contain similar statement(s); (3) our ML-based approach CoRec recommended entity changes with high accuracy (73%--78%). CoRec complements prior work because it suggests changes based on program syntax, textual similarity, as well as software history; it achieved higher accuracy than two existing tools in our evaluation.

CVMay 27, 2025
Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

Muzhi Zhu, Hao Zhong, Canyu Zhao et al.

Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention. However, despite the importance of active perception in embodied intelligence, there is little to no exploration of how MLLMs can be equipped with or learn active perception capabilities. In this paper, we first provide a systematic definition of MLLM-based active perception tasks. We point out that the recently proposed GPT-o3 model's zoom-in search strategy can be regarded as a special case of active perception; however, it still suffers from low search efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-O3, a purely reinforcement learning based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks, such as small-object and dense object grounding, and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. In addition, ACTIVE-O3 also demonstrates strong zero-shot reasoning abilities on the V* Benchmark, without relying on any explicit reasoning data. We hope that our work can provide a simple codebase and evaluation protocol to facilitate future research on active perception in MLLMs.

CVApr 22
Exploring Spatial Intelligence from a Generative Perspective

Muzhi Zhu, Shunyao Jiang, Huanyi Zheng et al.

Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI), the ability to respect and manipulate 3D spatial constraints during image generation, and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling. Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning, establishing a new pathway for advancing spatial intelligence in multimodal models.

CVApr 9
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

Yiduo Jia, Muzhi Zhu, Hao Zhong et al.

To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.

ROOct 4, 2025
NoTVLA: Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation

Zheng Huang, Mingyu Liu, Xiaoyi Lin et al.

Vision-Language-Action (VLA) models represent a pivotal advance in embodied intelligence, yet they confront critical barriers to real-world deployment, most notably catastrophic forgetting. This issue stems from their overreliance on continuous action sequences or action chunks, which inadvertently create isolated data silos that disrupt knowledge retention across tasks. To tackle these challenges, we propose the Narrowing of Trajectory VLA (NoTVLA) framework: a novel approach that narrows its focus to sparse trajectories, thereby avoiding the catastrophic forgetting associated with dense trajectory fine-tuning. A key innovation of NoTVLA lies in its trajectory planning strategy: instead of centering on the target object's trajectory, it leverages temporal compression and spatial reasoning pruning specifically for the robot end effector's trajectory. Furthermore, training is conducted using these sparse trajectories rather than dense action trajectories, an optimization that delivers remarkable practical advantages with better performance in zero-shot. In multi-task evaluation scenarios, NoTVLA achieves superior performance and generalization compared to pi0 while operating under two critical constraints: it uses over an order of magnitude less computing power than pi0 and requires no wrist-mounted camera. This design ensures that NoTVLA's operational accuracy closely approximates that of single-task expert models. Crucially, it also preserves the model's inherent language capabilities, enabling zero-shot generalization in specific scenarios, supporting unified model deployment across multiple robot platforms, and fostering a degree of generalization even when perceiving tasks from novel perspectives.

AISep 28, 2025
GUI-Shepherd: Reliable Process Reward and Verification for Long-Sequence GUI Tasks

Cong Chen, Kaixiang Ji, Hao Zhong et al.

Autonomous agents for long-sequence Graphical User Interface tasks are hindered by sparse rewards and the intractable credit assignment problem. To address these challenges, we introduce GUI-Shepherd, a Process Reward Model that provides dense, step-by-step feedback to guide agents. GUI-Shepherd is trained on a diverse large-scale data set of $52$k interactions that features human-annotated scores and GPT-4o generated rationales, enabling it to serve both as a reward provider for RL training and as a verifier for inference. As far as we know, we are the first to conduct a systematic study of process supervision in GUI agents, across diverse settings from online long-horizon tasks to offline single-step prediction. On the online AndroidWorld benchmark, GUI-Shepherd improves success rate by $7.7$ points via multi-turn online PPO, significantly outperforming Outcome Reward Model based competitors. When used as an inference verifier, it brings $5.1$ points improvements. The benefits generalize to the offline AndroidControl benchmark, with gains of $2.2$ points as a reward provider and $4.3$ points as a verifier. Collectively, our results establish that high-fidelity process supervision is critical for building more capable GUI agents and present a generalizable solution.

SEMay 6, 2021
Migrating Client Code without Change Examples

Hao Zhong, Na Meng

API developers evolve software libraries to fix bugs, add new features, or refactor code. To benefit from such library evolution, the programmers of client projects have to repetitively upgrade their library usages and adapt their codebases to any library API breaking changes (e.g., API renaming). Such adaptive changes can be tedious and error-prone. Existing tools provide limited support to help programmers migrate client projects from old library versions to new ones. For instance, some tools extract API mappings be-tween library versions and only suggest simple adaptive changes (i.e., statement updates); other tools suggest or automate more complicated edits (e.g., statement insertions) based on user-provided exemplar code migrations. However, when new library versions are available, it is usually cumbersome and time-consuming for users to provide sufficient human-crafted samples in order to guide automatic migration. In this paper, we propose a novel approach, AutoUpdate, to further improve the state of the art. Instead of learning from change examples, we designed AutoUpdate to automate migration in a compiler-directed way. Namely, given a compilation error triggered by upgrading libraries, AutoUpdate exploits 13 migration opera-tors to generate candidate edits, and tentatively applies each edit until the error is resolved or all edits are explored. We conducted two experiments. The first experiment involves migrating 371 tutorial examples between versions of 5 popular libraries. AutoUpdate reduced migration-related compilation errors for 92.7% of tasks. It eliminated such errors for 32.4% of tasks, and 33.9% of the tasks have identical edits to manual migrations. In the second experiment, we applied AutoUpdate to migrate two real client projects of lucene. AutoUpdate successfully migrated both projects, and the migrated code passed all tests.

SEFeb 13, 2019
Vulnerability Prediction Based on Weighted Software Network for Secure Software Building

Shengjun Wei, Hao Zhong, Chun Shan et al.

To build a secure communications software, Vulnerability Prediction Models (VPMs) are used to predict vulnerable software modules in the software system before software security testing. At present many software security metrics have been proposed to design a VPM. In this paper, we predict vulnerable classes in a software system by establishing the system's weighted software network. The metrics are obtained from the nodes' attributes in the weighted software network. We design and implement a crawler tool to collect all public security vulnerabilities in Mozilla Firefox. Based on these data, the prediction model is trained and tested. The results show that the VPM based on weighted software network has a good performance in accuracy, precision, and recall. Compared to other studies, it shows that the performance of prediction has been improved greatly in Pr and Re.