Xiangyu Zheng

CV
h-index98
10papers
117citations
Novelty54%
AI Score45

10 Papers

CVMar 13, 2024Code
OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework

Wanyun Li, Pinxue Guo, Xinyu Zhou et al.

Contemporary Video Object Segmentation (VOS) approaches typically consist stages of feature extraction, matching, memory management, and multiple objects aggregation. Recent advanced models either employ a discrete modeling for these components in a sequential manner, or optimize a combined pipeline through substructure aggregation. However, these existing explicit staged approaches prevent the VOS framework from being optimized as a unified whole, leading to the limited capacity and suboptimal performance in tackling complex videos. In this paper, we propose OneVOS, a novel framework that unifies the core components of VOS with All-in-One Transformer. Specifically, to unify all aforementioned modules into a vision transformer, we model all the features of frames, masks and memory for multiple objects as transformer tokens, and integrally accomplish feature extraction, matching and memory management of multiple objects through the flexible attention mechanism. Furthermore, a Unidirectional Hybrid Attention is proposed through a double decoupling of the original attention operation, to rectify semantic errors and ambiguities of stored tokens in OneVOS framework. Finally, to alleviate the storage burden and expedite inference, we propose the Dynamic Token Selector, which unveils the working mechanism of OneVOS and naturally leads to a more efficient version of OneVOS. Extensive experiments demonstrate the superiority of OneVOS, achieving state-of-the-art performance across 7 datasets, particularly excelling in complex LVOS and MOSE datasets with 70.1% and 66.4% $J \& F$ scores, surpassing previous state-of-the-art methods by 4.2% and 7.0%, respectively. And our code will be available for reproducibility and further research.

CVMar 2
FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

Yiweng Xie, Bo He, Junke Wang et al.

This paper presents FluxMem, a training-free framework for efficient streaming video understanding. FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design: (1) a Temporal Adjacency Selection (TAS) module removes redundant visual tokens across adjacent frames, and (2) a Spatial Domain Consolidation (SDC) module further merges spatially repetitive regions within each frame into compact representations. To adapt effectively to dynamic scenes, we introduce a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning. Extensive experiments demonstrate that FluxMem achieves new state-of-the-art results on existing online video benchmarks, reaching 76.4 on StreamingBench and 67.2 on OVO-Bench under real-time settings, while reducing latency by 69.9% and peak GPU memory by 34.5% on OVO-Bench. Furthermore, it maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.

RODec 5, 2025Code
HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

Zhiying Du, Bei Liu, Yaobo Liang et al.

The development of foundation models for embodied intelligence critically depends on access to large-scale, high-quality robot demonstration data. Recent approaches have sought to address this challenge by training on large collections of heterogeneous robotic datasets. However, unlike vision or language data, robotic demonstrations exhibit substantial heterogeneity across embodiments and action spaces as well as other prominent variations such as senor configurations and action control frequencies. The lack of explicit designs for handling such heterogeneity causes existing methods to struggle with integrating diverse factors, thereby limiting their generalization and leading to degraded performance when transferred to new settings. In this paper, we present HiMoE-VLA, a novel vision-language-action (VLA) framework tailored to effectively handle diverse robotic data with heterogeneity. Specifically, we introduce a Hierarchical Mixture-of-Experts (HiMoE) architecture for the action module which adaptively handles multiple sources of heterogeneity across layers and gradually abstracts them into shared knowledge representations. Through extensive experimentation with simulation benchmarks and real-world robotic platforms, HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalization across diverse robots and action spaces. The code and models are publicly available at https://github.com/ZhiyingDu/HiMoE-VLA.

CVApr 17, 2025
NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

Xin Li, Yeying Jin, Xin Jin et al.

This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/.

MTRL-SCIMay 13, 2025
Self-Optimizing Machine Learning Potential Assisted Automated Workflow for Highly Efficient Complex Systems Material Design

Jiaxiang Li, Junwei Feng, Jie Luo et al.

Machine learning interatomic potentials have revolutionized complex materials design by enabling rapid exploration of material configurational spaces via crystal structure prediction with ab initio accuracy. However, critical challenges persist in ensuring robust generalization to unknown structures and minimizing the requirement for substantial expert knowledge and time-consuming manual interventions. Here, we propose an automated crystal structure prediction framework built upon the attention-coupled neural networks potential to address these limitations. The generalizability of the potential is achieved by sampling regions across the local minima of the potential energy surface, where the self-evolving pipeline autonomously refines the potential iteratively while minimizing human intervention. The workflow is validated on Mg-Ca-H ternary and Be-P-N-O quaternary systems by exploring nearly 10 million configurations, demonstrating substantial speedup compared to first-principles calculations. These results underscore the effectiveness of our approach in accelerating the exploration and discovery of complex multi-component functional materials.

CVApr 8, 2025
Saliency-Motion Guided Trunk-Collateral Network for Unsupervised Video Object Segmentation

Xiangyu Zheng, Wanyun Li, Songcheng He et al.

Recent mainstream unsupervised video object segmentation (UVOS) motion-appearance approaches use either the bi-encoder structure to separately encode motion and appearance features, or the uni-encoder structure for joint encoding. However, these methods fail to properly balance the motion-appearance relationship. Consequently, even with complex fusion modules for motion-appearance integration, the extracted suboptimal features degrade the models' overall performance. Moreover, the quality of optical flow varies across scenarios, making it insufficient to rely solely on optical flow to achieve high-quality segmentation results. To address these challenges, we propose the Saliency-Motion guided Trunk-Collateral Network (SMTC-Net), which better balances the motion-appearance relationship and incorporates model's intrinsic saliency information to enhance segmentation performance. Specifically, considering that optical flow maps are derived from RGB images, they share both commonalities and differences. Accordingly, we propose a novel Trunk-Collateral structure for motion-appearance UVOS. The shared trunk backbone captures the motion-appearance commonality, while the collateral branch learns the uniqueness of motion features. Furthermore, an Intrinsic Saliency guided Refinement Module (ISRM) is devised to efficiently leverage the model's intrinsic saliency information to refine high-level features, and provide pixel-level guidance for motion-appearance fusion, thereby enhancing performance without additional input. Experimental results show that SMTC-Net achieved state-of-the-art performance on three UVOS datasets ( 89.2% J&F on DAVIS-16, 76% J on YouTube-Objects, 86.4% J on FBMS ) and four standard video salient object detection (VSOD) benchmarks with the notable increase, demonstrating its effectiveness and superiority over previous methods.

CLFeb 20, 2025
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

M-A-P Team, Xinrun Du, Yifan Yao et al.

Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.

LGDec 5, 2023
UTBoost: Gradient Boosted Decision Trees for Uplift Modeling

Junjie Gao, Xiangyu Zheng, DongDong Wang et al.

Uplift modeling comprises a collection of machine learning techniques designed for managers to predict the incremental impact of specific actions on customer outcomes. However, accurately estimating this incremental impact poses significant challenges due to the necessity of determining the difference between two mutually exclusive outcomes for each individual. In our study, we introduce two novel modifications to the established Gradient Boosting Decision Trees (GBDT) technique. These modifications sequentially learn the causal effect, addressing the counterfactual dilemma. Each modification innovates upon the existing technique in terms of the ensemble learning method and the learning objective, respectively. Experiments with large-scale datasets validate the effectiveness of our methods, consistently achieving substantial improvements over baseline models.

MLJul 5, 2021
Which Invariance Should We Transfer? A Causal Minimax Learning Approach

Mingzhou Liu, Xiangyu Zheng, Xinwei Sun et al.

A major barrier to deploying current machine learning models lies in their non-reliability to dataset shifts. To resolve this problem, most existing studies attempted to transfer stable information to unseen environments. Particularly, independent causal mechanisms-based methods proposed to remove mutable causal mechanisms via the do-operator. Compared to previous methods, the obtained stable predictors are more effective in identifying stable information. However, a key question remains: which subset of this whole stable information should the model transfer, in order to achieve optimal generalization ability? To answer this question, we present a comprehensive minimax analysis from a causal perspective. Specifically, we first provide a graphical condition for the whole stable set to be optimal. When this condition fails, we surprisingly find with an example that this whole stable set, although can fully exploit stable information, is not the optimal one to transfer. To identify the optimal subset under this case, we propose to estimate the worst-case risk with a novel optimization scheme over the intervention functions on mutable causal mechanisms. We then propose an efficient algorithm to search for the subset with minimal worst-case risk, based on a newly defined equivalence relation between stable subsets. Compared to the exponential cost of exhaustively searching over all subsets, our searching strategy enjoys a polynomial complexity. The effectiveness and efficiency of our methods are demonstrated on synthetic data and the diagnosis of Alzheimer's disease.

LGNov 4, 2020
Latent Causal Invariant Model

Xinwei Sun, Botong Wu, Xiangyu Zheng et al.

Current supervised learning can learn spurious correlation during the data-fitting process, imposing issues regarding interpretability, out-of-distribution (OOD) generalization, and robustness. To avoid spurious correlation, we propose a Latent Causal Invariance Model (LaCIM) which pursues causal prediction. Specifically, we introduce latent variables that are separated into (a) output-causative factors and (b) others that are spuriously correlated to the output via confounders, to model the underlying causal factors. We further assume the generating mechanisms from latent space to observed data to be causally invariant. We give the identifiable claim of such invariance, particularly the disentanglement of output-causative factors from others, as a theoretical guarantee for precise inference and avoiding spurious correlation. We propose a Variational-Bayesian-based method for estimation and to optimize over the latent space for prediction. The utility of our approach is verified by improved interpretability, prediction power on various OOD scenarios (including healthcare) and robustness on security.