ASSep 20, 2024Code
GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing TasksYu Zhang, Changhao Pan, Wenxiang Guo et al.
The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singers, absence of multi-technique information and realistic music scores, and poor task suitability. To tackle these problems, we present GTSinger, a large global, multi-technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks. Particularly, (1) we collect 80.59 hours of high-quality singing voices, forming the largest recorded singing dataset; (2) 20 professional singers across nine widely spoken languages offer diverse timbres and styles; (3) we provide controlled comparison and phoneme-level annotations of six commonly used singing techniques, helping technique modeling and control; (4) GTSinger offers realistic music scores, assisting real-world musical composition; (5) singing voices are accompanied by manual phoneme-to-audio alignments, global style labels, and 16.16 hours of paired speech for various singing tasks. Moreover, to facilitate the use of GTSinger, we conduct four benchmark experiments: technique-controllable singing voice synthesis, technique recognition, style transfer, and speech-to-singing conversion. The demos can be found at http://aaronz345.github.io/GTSingerDemo/. We provide the dataset and the code for processing data and conducting benchmarks at https://huggingface.co/datasets/AaronZ345/GTSinger and https://github.com/AaronZ345/GTSinger.
ROJun 3
CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait AdaptationKailun Huang, Zikang Xie, Yanzhe Xie et al.
Humans primarily rely on walking and running to traverse complex terrains, without resorting to unnecessarily complex motion patterns. Similarly, humanoid robots should achieve smooth transitions between walking and running while maintaining natural and stable locomotion. However, unifying gait transition and multi-terrain adaptation within a single policy remains challenging due to gradient interference and the distribution shift induced by terrain-dependent visual and dynamic variations. Although Mixture-of-Experts (MoE) architectures can alleviate multi-skill interference, naive joint training often fails to yield clear expert specialization, limiting their effectiveness. To address these challenges, we propose CoRe-MoE, a two-stage reinforcement learning framework that decouples gait generation from terrain adaptation. In the first stage, a stable locomotion policy is learned to produce natural walking and running behaviors with smooth transitions. In the second stage, a terrain-aware MoE branch is introduced and trained with a contrastive objective to shape the gating network, enabling it to capture structured terrain representations and promote expert specialization. The final action is obtained via weighted fusion of the base gait policy and the terrain-aware branch, allowing the policy to preserve stable locomotion patterns while adapting to complex terrains. Extensive simulation results demonstrate that the proposed method outperforms baseline approaches in terms of success rate, locomotion stability, and multi-terrain adaptability. Furthermore, zero-shot deployment on a Unitree G1 humanoid robot validates the effectiveness of our framework, achieving robust walking and running across stairs, slopes, steps, obstacles, and unstructured outdoor terrains, while maintaining accurate foothold placement and dynamic stability under external disturbances.
CVMar 15, 2023Code
Mining False Positive Examples for Text-Based Person Re-identificationWenhao Xu, Zhiyin Shao, Changxing Ding
Text-based person re-identification (ReID) aims to identify images of the targeted person from a large-scale person image database according to a given textual description. However, due to significant inter-modal gaps, text-based person ReID remains a challenging problem. Most existing methods generally rely heavily on the similarity contributed by matched word-region pairs, while neglecting mismatched word-region pairs which may play a decisive role. Accordingly, we propose to mine false positive examples (MFPE) via a jointly optimized multi-branch architecture to handle this problem. MFPE contains three branches including a false positive mining (FPM) branch to highlight the role of mismatched word-region pairs. Besides, MFPE delicately designs a cross-relu loss to increase the gap of similarity scores between matched and mismatched word-region pairs. Extensive experiments on CUHK-PEDES demonstrate the superior effectiveness of MFPE. Our code is released at https://github.com/xx-adeline/MFPE.
CVJul 5, 2024Code
MobileFlow: A Multimodal LLM For Mobile GUI AgentSongqin Nong, Jiali Zhu, Rui Wu et al.
Currently, the integration of mobile Graphical User Interfaces (GUIs) is ubiquitous in most people's daily lives. And the ongoing evolution of multimodal large-scale models, such as GPT-4v, Qwen-VL-Max, has significantly bolstered the capabilities of GUI comprehension and user action analysis, showcasing the potentiality of intelligent GUI assistants. However, current GUI Agents often need to access page layout information through calling system APIs, which may pose privacy risks. Fixing GUI (such as mobile interfaces) to a certain low resolution might result in the loss of fine-grained image details. At the same time, the multimodal large models built for GUI Agents currently have poor understanding and decision-making abilities for Chinese GUI interfaces, making them difficult to apply to a large number of Chinese apps. This paper introduces MobileFlow, a multimodal large language model meticulously crafted for mobile GUI agents. Transforming from the open-source model Qwen-VL-Chat into GUI domain, MobileFlow contains approximately 21 billion parameters and is equipped with novel hybrid visual encoders, making it possible for variable resolutions of image inputs and good support for multilingual GUI. By incorporating Mixture of Experts (MoE) expansions and pioneering alignment training strategies, MobileFlow has the capacity to fully interpret image data and comprehend user instructions for GUI interaction tasks. Finally, MobileFlow outperforms Qwen-VL-Max and GPT-4v in terms of task execution by GUI agents on both public and our proposed evaluation metrics, and has been successfully deployed in real-world business contexts, proving its effectiveness for practical applications.
CVSep 13, 2024
PSTNet: Enhanced Polyp Segmentation with Multi-scale Alignment and Frequency Domain IntegrationWenhao Xu, Rongtao Xu, Changwei Wang et al.
Accurate segmentation of colorectal polyps in colonoscopy images is crucial for effective diagnosis and management of colorectal cancer (CRC). However, current deep learning-based methods primarily rely on fusing RGB information across multiple scales, leading to limitations in accurately identifying polyps due to restricted RGB domain information and challenges in feature misalignment during multi-scale aggregation. To address these limitations, we propose the Polyp Segmentation Network with Shunted Transformer (PSTNet), a novel approach that integrates both RGB and frequency domain cues present in the images. PSTNet comprises three key modules: the Frequency Characterization Attention Module (FCAM) for extracting frequency cues and capturing polyp characteristics, the Feature Supplementary Alignment Module (FSAM) for aligning semantic information and reducing misalignment noise, and the Cross Perception localization Module (CPM) for synergizing frequency cues with high-level semantics to achieve efficient polyp segmentation. Extensive experiments on challenging datasets demonstrate PSTNet's significant improvement in polyp segmentation accuracy across various metrics, consistently outperforming state-of-the-art methods. The integration of frequency domain cues and the novel architectural design of PSTNet contribute to advancing computer-assisted polyp segmentation, facilitating more accurate diagnosis and management of CRC.
ROMay 25
ParkourFormer: Integrating Predictive Supervision and Sequence Modeling into Parkour LocomotionYanheng Mai, Wenhao Xu, Zirui Huang et al.
Humanoid parkour requires locomotion policies to coordinate whole-body dynamics across rapidly changing terrains such as stairs, gaps, slopes, and obstacles. Existing reinforcement learning policies are largely reactive, mapping observations directly to actions without explicitly modeling future body states. Such modeling becomes critical in agile locomotion tasks where successful motion execution depends strongly on anticipating upcoming contact transitions and body dynamics.We present ParkourFormer, a Transformer-based sequence modeling framework that reformulates humanoid locomotion as a future-conditioned decision-making problem. The current robot state queries historical sensorimotor trajectories through cross-attention, while a lightweight prediction head forecasts short-horizon future proprioceptive states. The predicted future states, trained with supervised signals, are fused with temporal features to generate actions, enabling the policy to jointly reason over motion history and anticipated future dynamics. We evaluate ParkourFormer on a diverse multi-terrain humanoid parkour benchmark including stairs, gaps, slopes, rough terrain, and obstacle traversal. Experiments in simulation and on a real humanoid robot show that ParkourFormer achieves a 93.85% average traversal success rate on highly challenging terrains, with improvements of up to 42.73% over strong MLP, MoE-based MLP, and vanilla Transformer baselines, while maintaining a single unified policy across all terrain types. These results demonstrate that explicit future-state modeling significantly improves robustness and generalization for agile whole-body locomotion.
CVSep 13, 2024
Generalization Boosted Adapter for Open-Vocabulary SegmentationWenhao Xu, Changwei Wang, Xuxiang Feng et al.
Vision-language models (VLMs) have demonstrated remarkable open-vocabulary object recognition capabilities, motivating their adaptation for dense prediction tasks like segmentation. However, directly applying VLMs to such tasks remains challenging due to their lack of pixel-level granularity and the limited data available for fine-tuning, leading to overfitting and poor generalization. To address these limitations, we propose Generalization Boosted Adapter (GBA), a novel adapter strategy that enhances the generalization and robustness of VLMs for open-vocabulary segmentation. GBA comprises two core components: (1) a Style Diversification Adapter (SDA) that decouples features into amplitude and phase components, operating solely on the amplitude to enrich the feature space representation while preserving semantic consistency; and (2) a Correlation Constraint Adapter (CCA) that employs cross-attention to establish tighter semantic associations between text categories and target regions, suppressing irrelevant low-frequency ``noise'' information and avoiding erroneous associations. Through the synergistic effect of the shallow SDA and the deep CCA, GBA effectively alleviates overfitting issues and enhances the semantic relevance of feature representations. As a simple, efficient, and plug-and-play component, GBA can be flexibly integrated into various CLIP-based methods, demonstrating broad applicability and achieving state-of-the-art performance on multiple open-vocabulary segmentation benchmarks.
CVJul 9, 2024
CEIA: CLIP-Based Event-Image Alignment for Open-World Event-Based UnderstandingWenhao Xu, Wenming Weng, Yueyi Zhang et al.
We present CEIA, an effective framework for open-world event-based understanding. Currently training a large event-text model still poses a huge challenge due to the shortage of paired event-text data. In response to this challenge, CEIA learns to align event and image data as an alternative instead of directly aligning event and text data. Specifically, we leverage the rich event-image datasets to learn an event embedding space aligned with the image space of CLIP through contrastive learning. In this way, event and text data are naturally aligned via using image data as a bridge. Particularly, CEIA offers two distinct advantages. First, it allows us to take full advantage of the existing event-image datasets to make up the shortage of large-scale event-text datasets. Second, leveraging more training data, it also exhibits the flexibility to boost performance, ensuring scalable capability. In highlighting the versatility of our framework, we make extensive evaluations through a diverse range of event-based multi-modal applications, such as object recognition, event-image retrieval, event-text retrieval, and domain adaptation. The outcomes demonstrate CEIA's distinct zero-shot superiority over existing methods on these applications.
LGJan 30, 2023
Regret Bounds for Markov Decision Processes with Recursive Optimized Certainty EquivalentsWenhao Xu, Xuefeng Gao, Xuedong He
The optimized certainty equivalent (OCE) is a family of risk measures that cover important examples such as entropic risk, conditional value-at-risk and mean-variance models. In this paper, we propose a new episodic risk-sensitive reinforcement learning formulation based on tabular Markov decision processes with recursive OCEs. We design an efficient learning algorithm for this problem based on value iteration and upper confidence bound. We derive an upper bound on the regret of the proposed algorithm, and also establish a minimax lower bound. Our bounds show that the regret rate achieved by our proposed algorithm has optimal dependence on the number of episodes and the number of actions.
CVMar 16, 2024Code
HCF-Net: Hierarchical Context Fusion Network for Infrared Small Object DetectionShibiao Xu, ShuChen Zheng, Wenhao Xu et al.
Infrared small object detection is an important computer vision task involving the recognition and localization of tiny objects in infrared images, which usually contain only a few pixels. However, it encounters difficulties due to the diminutive size of the objects and the generally complex backgrounds in infrared images. In this paper, we propose a deep learning method, HCF-Net, that significantly improves infrared small object detection performance through multiple practical modules. Specifically, it includes the parallelized patch-aware attention (PPA) module, dimension-aware selective integration (DASI) module, and multi-dilated channel refiner (MDCR) module. The PPA module uses a multi-branch feature extraction strategy to capture feature information at different scales and levels. The DASI module enables adaptive channel selection and fusion. The MDCR module captures spatial features of different receptive field ranges through multiple depth-separable convolutional layers. Extensive experimental results on the SIRST infrared single-frame image dataset show that the proposed HCF-Net performs well, surpassing other traditional and deep learning models. Code is available at https://github.com/zhengshuchen/HCFNet.
CVMar 3
Modeling Cross-vision Synergy for Unified Large Vision ModelShengqiong Wu, Lanhu Wu, Mingyang Bao et al.
Recent advances in large vision models (LVMs) have shifted from modality-specific designs toward unified architectures that jointly process images, videos, and 3D data. However, existing unified LVMs primarily pursue functional integration, while overlooking the deeper goal of cross-vision synergy: the ability to reason over complementary priors across visual modalities. To address this, we present PolyV, a unified LVM that achieves cross-vision synergy at both the architectural and training levels. Architecturally, PolyV adopts a sparse Mixture-of-Experts LVM coordinated by a dynamic modality router, allowing each expert to specialize in modality-specific priors while enabling bidirectional interaction and mutual refinement across modalities. Training-wise, a synergy-aware paradigm combines modality-specific pretraining with coarse-to-fine synergy tuning via knowledge distillation and object-/relation-level alignment. Extensive experiments on 10 benchmarks spanning image, video, and 3D understanding, including synergy-focused datasets requiring spatial or temporal priors, demonstrate that PolyV consistently outperforms existing models, achieving over 10% average improvement over its backbone. Overall, PolyV establishes a unified framework for synesthetic visual reasoning, advancing toward truly synergistic LVMs. Project page: https://sqwu.top/PolyV.
CVApr 15
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI GroundingFei Tang, Bofan Chen, Zhengxi Lu et al.
GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.
CVDec 20, 2023Code
Spectral Prompt Tuning:Unveiling Unseen Classes for Zero-Shot Semantic SegmentationWenhao Xu, Rongtao Xu, Changwei Wang et al.
Recently, CLIP has found practical utility in the domain of pixel-level zero-shot segmentation tasks. The present landscape features two-stage methodologies beset by issues such as intricate pipelines and elevated computational costs. While current one-stage approaches alleviate these concerns and incorporate Visual Prompt Training (VPT) to uphold CLIP's generalization capacity, they still fall short in fully harnessing CLIP's potential for pixel-level unseen class demarcation and precise pixel predictions. To further stimulate CLIP's zero-shot dense prediction capability, we propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from image to pixel. Specifically, we initially introduce Spectral Prompt Tuning (SPT), incorporating spectral prompts into the CLIP visual encoder's shallow layers to capture structural intricacies of images, thereby enhancing comprehension of unseen classes. Subsequently, we introduce the Spectral Guided Decoder (SGD), utilizing both high and low-frequency information to steer the network's spatial focus towards more prominent classification features, enabling precise pixel-level prediction outcomes. Through extensive experiments on two public datasets, we demonstrate the superiority of our method over state-of-the-art approaches, performing well across all classes and particularly excelling in handling unseen classes. Code is available at:https://github.com/clearxu/SPT.
CVMay 29, 2025Code
SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target DetectionWenhao Xu, Shuchen Zheng, Changwei Wang et al.
Infrared small target detection (ISTD) is vital for long-range surveillance in military, maritime, and early warning applications. ISTD is challenged by targets occupying less than 0.15% of the image and low distinguishability from complex backgrounds. Existing deep learning methods often suffer from information loss during downsampling and inefficient global context modeling. This paper presents SAMamba, a novel framework integrating SAM2's hierarchical feature learning with Mamba's selective sequence modeling. Key innovations include: (1) A Feature Selection Adapter (FS-Adapter) for efficient natural-to-infrared domain adaptation via dual-stage selection (token-level with a learnable task embedding and channel-wise adaptive transformations); (2) A Cross-Channel State-Space Interaction (CSI) module for efficient global context modeling with linear complexity using selective state space modeling; and (3) A Detail-Preserving Contextual Fusion (DPCF) module that adaptively combines multi-scale features with a gating mechanism to balance high-resolution and low-resolution feature contributions. SAMamba addresses core ISTD challenges by bridging the domain gap, maintaining fine-grained details, and efficiently modeling long-range dependencies. Experiments on NUAA-SIRST, IRSTD-1k, and NUDT-SIRST datasets show SAMamba significantly outperforms state-of-the-art methods, especially in challenging scenarios with heterogeneous backgrounds and varying target scales. Code: https://github.com/zhengshuchen/SAMamba.
CVMay 6, 2025Code
Image Recognition with Online Lightweight Vision Transformer: A SurveyZherui Zhang, Rongtao Xu, Jie Zhou et al.
The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range dependencies and enable parallel processing, yet lack inductive biases and efficiency benefits, facing significant computational and memory challenges that limit its real-world applicability. This paper surveys various online strategies for generating lightweight vision transformers for image recognition, focusing on three key areas: Efficient Component Design, Dynamic Network, and Knowledge Distillation. We evaluate the relevant exploration for each topic on the ImageNet-1K benchmark, analyzing trade-offs among precision, parameters, throughput, and more to highlight their respective advantages, disadvantages, and flexibility. Finally, we propose future research directions and potential challenges in the lightweighting of vision transformers with the aim of inspiring further exploration and providing practical guidance for the community. Project Page: https://github.com/ajxklo/Lightweight-VIT
CVSep 30, 2025Code
SAGE: Spatial-visual Adaptive Graph Exploration for Visual Place RecognitionShunpeng Chen, Changwei Wang, Rongtao Xu et al.
Visual Place Recognition (VPR) requires robust retrieval of geotagged images despite large appearance, viewpoint, and environmental variation. Prior methods focus on descriptor fine-tuning or fixed sampling strategies yet neglect the dynamic interplay between spatial context and visual similarity during training. We present SAGE (Spatial-visual Adaptive Graph Exploration), a unified training pipeline that enhances granular spatial-visual discrimination by jointly improving local feature aggregation, organize samples during training, and hard sample mining. We introduce a lightweight Soft Probing module that learns residual weights from training data for patch descriptors before bilinear aggregation, boosting distinctive local cues. During training we reconstruct an online geo-visual graph that fuses geographic proximity and current visual similarity so that candidate neighborhoods reflect the evolving embedding landscape. To concentrate learning on the most informative place neighborhoods, we seed clusters from high-affinity anchors and iteratively expand them with a greedy weighted clique expansion sampler. Implemented with a frozen DINOv2 backbone and parameter-efficient fine-tuning, SAGE achieves SOTA across eight benchmarks. It attains 98.9%, 95.8%, 94.5%, and 96.0% Recall@1 on SPED, Pitts30k-test, MSLS-val, and Nordland, respectively. Notably, our method obtains 100% Recall@10 on SPED only using 4096D global descriptors. Code and model will be available at: https://github.com/chenshunpeng/SAGE.
CYNov 11, 2025
Judging by the Rules: Compliance-Aligned Framework for Modern Slavery Statement MonitoringWenhao Xu, Akshatha Arodi, Jian-Yun Nie et al.
Modern slavery affects millions of people worldwide, and regulatory frameworks such as Modern Slavery Acts now require companies to publish detailed disclosures. However, these statements are often vague and inconsistent, making manual review time-consuming and difficult to scale. While NLP offers a promising path forward, high-stakes compliance tasks require more than accurate classification: they demand transparent, rule-aligned outputs that legal experts can verify. Existing applications of large language models (LLMs) often reduce complex regulatory assessments to binary decisions, lacking the necessary structure for robust legal scrutiny. We argue that compliance verification is fundamentally a rule-matching problem: it requires evaluating whether textual statements adhere to well-defined regulatory rules. To this end, we propose a novel framework that harnesses AI for rule-level compliance verification while preserving expert oversight. At its core is the Compliance Alignment Judge (CA-Judge), which evaluates model-generated justifications based on their fidelity to statutory requirements. Using this feedback, we train the Compliance Alignment LLM (CALLM), a model that produces rule-consistent, human-verifiable outputs. CALLM improves predictive performance and generates outputs that are both transparent and legally grounded, offering a more verifiable and actionable solution for real-world compliance analysis.
CVMar 12
DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code GenerationXinhao Huang, Jinke Yu, Wenhao Xu et al.
While Vision Language Models (VLMs) have shown promise in Design-to-Code generation, they suffer from a "holistic bottleneck-failing to reconcile high-level structural hierarchy with fine-grained visual details, often resulting in layout distortions or generic placeholders. To bridge this gap, we propose DOne, an end-to-end framework that decouples structure understanding from element rendering. DOne introduces (1) a learned layout segmentation module to decompose complex designs, avoiding the limitations of heuristic cropping; (2) a specialized hybrid element retriever to handle the extreme aspect ratios and densities of UI components; and (3) a schema-guided generation paradigm that bridges layout and code. To rigorously assess performance, we introduce HiFi2Code, a benchmark featuring significantly higher layout complexity than existing datasets. Extensive evaluations on the HiFi2Code demonstrate that DOne outperforms exiting methods in both high-level visual similarity (e.g., over 10% in GPT Score) and fine-grained element alignment. Human evaluations confirm a 3 times productivity gain with higher visual fidelity.
CLMar 28, 2024
MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text EvaluationYu Li, Shenyu Zhang, Rui Wu et al.
Recent advancements in generative Large Language Models(LLMs) have been remarkable, however, the quality of the text generated by these models often reveals persistent issues. Evaluating the quality of text generated by these models, especially in open-ended text, has consistently presented a significant challenge. Addressing this, recent work has explored the possibility of using LLMs as evaluators. While using a single LLM as an evaluation agent shows potential, it is filled with significant uncertainty and instability. To address these issues, we propose the MATEval: A "Multi-Agent Text Evaluation framework" where all agents are played by LLMs like GPT-4. The MATEval framework emulates human collaborative discussion methods, integrating multiple agents' interactions to evaluate open-ended text. Our framework incorporates self-reflection and Chain-of-Thought (CoT) strategies, along with feedback mechanisms, enhancing the depth and breadth of the evaluation process and guiding discussions towards consensus, while the framework generates comprehensive evaluation reports, including error localization, error types and scoring. Experimental results show that our framework outperforms existing open-ended text evaluation methods and achieves the highest correlation with human evaluation, which confirms the effectiveness and advancement of our framework in addressing the uncertainties and instabilities in evaluating LLMs-generated text. Furthermore, our framework significantly improves the efficiency of text evaluation and model iteration in industrial scenarios.
CVApr 30, 2025
CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge DistillationZherui Zhang, Changwei Wang, Rongtao Xu et al.
Data-Free Knowledge Distillation (DFKD) enables the knowledge transfer from the given pre-trained teacher network to the target student model without access to the real training data. Existing DFKD methods focus primarily on improving image recognition performance on associated datasets, often neglecting the crucial aspect of the transferability of learned representations. In this paper, we propose Category-Aware Embedding Data-Free Knowledge Distillation (CAE-DFKD), which addresses at the embedding level the limitations of previous rely on image-level methods to improve model generalization but fail when directly applied to DFKD. The superiority and flexibility of CAE-DFKD are extensively evaluated, including: \textit{\textbf{i.)}} Significant efficiency advantages resulting from altering the generator training paradigm; \textit{\textbf{ii.)}} Competitive performance with existing DFKD state-of-the-art methods on image recognition tasks; \textit{\textbf{iii.)}} Remarkable transferability of data-free learned representations demonstrated in downstream tasks.
CVOct 14, 2025
CurriFlow: Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene CompletionJinzhou Lin, Jie Zhou, Wenhao Xu et al.
Semantic Scene Completion (SSC) aims to infer complete 3D geometry and semantics from monocular images, serving as a crucial capability for camera-based perception in autonomous driving. However, existing SSC methods relying on temporal stacking or depth projection often lack explicit motion reasoning and struggle with occlusions and noisy depth supervision. We propose CurriFlow, a novel semantic occupancy prediction framework that integrates optical flow-based temporal alignment with curriculum-guided depth fusion. CurriFlow employs a multi-level fusion strategy to align segmentation, visual, and depth features across frames using pre-trained optical flow, thereby improving temporal consistency and dynamic object understanding. To enhance geometric robustness, a curriculum learning mechanism progressively transitions from sparse yet accurate LiDAR depth to dense but noisy stereo depth during training, ensuring stable optimization and seamless adaptation to real-world deployment. Furthermore, semantic priors from the Segment Anything Model (SAM) provide category-agnostic supervision, strengthening voxel-level semantic learning and spatial consistency. Experiments on the SemanticKITTI benchmark demonstrate that CurriFlow achieves state-of-the-art performance with a mean IoU of 16.9, validating the effectiveness of our motion-guided and curriculum-aware design for camera-based 3D semantic scene completion.
CLMar 18, 2024
DEE: Dual-stage Explainable Evaluation Method for Text GenerationShenyu Zhang, Yu Li, Rui Wu et al.
Automatic methods for evaluating machine-generated texts hold significant importance due to the expanding applications of generative systems. Conventional methods tend to grapple with a lack of explainability, issuing a solitary numerical score to signify the assessment outcome. Recent advancements have sought to mitigate this limitation by incorporating large language models (LLMs) to offer more detailed error analyses, yet their applicability remains constrained, particularly in industrial contexts where comprehensive error coverage and swift detection are paramount. To alleviate these challenges, we introduce DEE, a Dual-stage Explainable Evaluation method for estimating the quality of text generation. Built upon Llama 2, DEE follows a dual-stage principle guided by stage-specific instructions to perform efficient identification of errors in generated texts in the initial stage and subsequently delves into providing comprehensive diagnostic reports in the second stage. DEE is fine-tuned on our elaborately assembled dataset AntEval, which encompasses 15K examples from 4 real-world applications of Alipay that employ generative systems. The dataset concerns newly emerged issues like hallucination and toxicity, thereby broadening the scope of DEE's evaluation criteria. Experimental results affirm that DEE's superiority over existing evaluation methods, achieving significant improvements in both human correlation as well as efficiency.
CVNov 24, 2025
EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language ModelsWenhao Xu, Xin Dong, Yue Li et al.
Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive, human-annotated multimodal benchmark that covers diverse real-world scenarios. Beyond physical event cameras, EventSTU also supports general video understanding using simulated events. Comprehensive experiments show that EventSTU achieves 3.01x FLOPs reduction and 3.10x prefilling speedup over the strongest baseline while still improving performance.
LGSep 23, 2025
GSTM-HMU: Generative Spatio-Temporal Modeling for Human Mobility UnderstandingWenying Luo, Zhiyuan Lin, Wenhao Xu et al.
Human mobility traces, often recorded as sequences of check-ins, provide a unique window into both short-term visiting patterns and persistent lifestyle regularities. In this work we introduce GSTM-HMU, a generative spatio-temporal framework designed to advance mobility analysis by explicitly modeling the semantic and temporal complexity of human movement. The framework consists of four key innovations. First, a Spatio-Temporal Concept Encoder (STCE) integrates geographic location, POI category semantics, and periodic temporal rhythms into unified vector representations. Second, a Cognitive Trajectory Memory (CTM) adaptively filters historical visits, emphasizing recent and behaviorally salient events in order to capture user intent more effectively. Third, a Lifestyle Concept Bank (LCB) contributes structured human preference cues, such as activity types and lifestyle patterns, to enhance interpretability and personalization. Finally, task-oriented generative heads transform the learned representations into predictions for multiple downstream tasks. We conduct extensive experiments on four widely used real-world datasets, including Gowalla, WeePlace, Brightkite, and FourSquare, and evaluate performance on three benchmark tasks: next-location prediction, trajectory-user identification, and time estimation. The results demonstrate consistent and substantial improvements over strong baselines, confirming the effectiveness of GSTM-HMU in extracting semantic regularities from complex mobility data. Beyond raw performance gains, our findings also suggest that generative modeling provides a promising foundation for building more robust, interpretable, and generalizable systems for human mobility intelligence.
AIAug 15, 2025
CRAFT-GUI: Curriculum-Reinforced Agent For GUI TasksSongqin Nong, Jingxuan Xu, Sheng Zhou et al.
As autonomous agents become adept at understanding and interacting with graphical user interface (GUI) environments, a new era of automated task execution is emerging. Recent studies have demonstrated that Reinforcement Learning (RL) can effectively enhance agents' performance in dynamic interactive GUI environments. However, these methods face two key limitations: (1) they overlook the significant variation in difficulty across different GUI tasks by treating the entire training data as a uniform set, which hampers the agent's ability to adapt its learning process; and (2) most approaches collapse task-specific nuances into a single, coarse reward, leaving the agent with a uniform signal that yields inefficient policy updates. To address these limitations, we propose CRAFT-GUI, a curriculum learning framework based on Group Relative Policy Optimization (GRPO) that explicitly accounts for the varying difficulty across trajectories. To enable more fine-grained policy optimization, we design a reward function that combines simple rule-based signals with model-judged evaluation, providing richer and more nuanced feedback during training. Experimental results demonstrate that our method achieves significant improvements over previous state-of-the-art approaches, outperforming them by 5.6% on public benchmarks Android Control and 10.3% on our internal online benchmarks, respectively. These findings empirically validate the effectiveness of integrating reinforcement learning with curriculum learning in GUI interaction tasks.
CVMay 23, 2025
FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models AdaptationZherui Zhang, Jiaxin Wu, Changwei Wang et al.
Prompt learning as a parameter-efficient method that has been widely adopted to adapt Vision-Language Models (VLMs) to downstream tasks. While hard-prompt design requires domain expertise and iterative optimization, soft-prompt methods rely heavily on task-specific hard labels, limiting their generalization to unseen categories. Recent popular distillation-based prompt learning methods improve generalization by exploiting larger teacher VLMs and unsupervised knowledge transfer, yet their repetitive teacher model online inference sacrifices the inherent training efficiency advantage of prompt learning. In this paper, we propose {\large {\textbf{F}}}aster {\large {\textbf{D}}}istillation-{\large {\textbf{B}}}ased {\large {\textbf{P}}}rompt {\large {\textbf{L}}}earning (\textbf{FDBPL}), which addresses these issues by sharing soft supervision contexts across multiple training stages and implementing accelerated I/O. Furthermore, FDBPL introduces a region-aware prompt learning paradigm with dual positive-negative prompt spaces to fully exploit randomly cropped regions that containing multi-level information. We propose a positive-negative space mutual learning mechanism based on similarity-difference learning, enabling student CLIP models to recognize correct semantics while learning to reject weakly related concepts, thereby improving zero-shot performance. Unlike existing distillation-based prompt learning methods that sacrifice parameter efficiency for generalization, FDBPL maintains dual advantages of parameter efficiency and strong downstream generalization. Comprehensive evaluations across 11 datasets demonstrate superior performance in base-to-new generalization, cross-dataset transfer, and robustness tests, achieving $2.2\times$ faster training speed.
SPApr 29, 2025
Generalised Label-free Artefact Cleaning for Real-time Medical Pulsatile Time SeriesXuhang Chen, Ihsane Olakorede, Stefan Yu Bögli et al.
Artefacts compromise clinical decision-making in the use of medical time series. Pulsatile waveforms offer probabilities for accurate artefact detection, yet most approaches rely on supervised manners and overlook patient-level distribution shifts. To address these issues, we introduce a generalised label-free framework, GenClean, for real-time artefact cleaning and leverage an in-house dataset of 180,000 ten-second arterial blood pressure (ABP) samples for training. We first investigate patient-level generalisation, demonstrating robust performances under both intra- and inter-patient distribution shifts. We further validate its effectiveness through challenging cross-disease cohort experiments on the MIMIC-III database. Additionally, we extend our method to photoplethysmography (PPG), highlighting its applicability to diverse medical pulsatile signals. Finally, its integration into ICM+, a clinical research monitoring software, confirms the real-time feasibility of our framework, emphasising its practical utility in continuous physiological monitoring. This work provides a foundational step toward precision medicine in improving the reliability of high-resolution medical time series analysis
CVMar 11, 2025
S2A: A Unified Framework for Parameter and Memory Efficient Transfer LearningTian Jin, Enjun Du, Changwei Wang et al.
Parameter-efficient transfer learning (PETL) aims to reduce the scales of pretrained models for multiple downstream tasks. However, as the models keep scaling up, the memory footprint of existing PETL methods is not significantly reduced compared to the reduction of learnable parameters. This limitation hinders the practical deployment of PETL methods on memory-constrained devices. To this end, we proposed a new PETL framework, called Structure to Activation (S2A), to reduce the memory footprint of activation during fine-tuning. Specifically, our framework consists of: 1) Activation modules design(i.e., bias, prompt and side modules) in the parametric model structure, which results in a significant reduction of adjustable parameters and activation memory; 2) 4-bit quantization of activations based on their derivatives for non-parametric structures (e.g., nonlinear functions), which maintains accuracy while significantly reducing memory usage. Our S2A method consequently offers a lightweight solution in terms of both parameters and memory footprint. We evaluated S2A with different backbones and performed extensive experiments on various datasets to evaluate the effectiveness. The results show that our methods not only outperform existing PETL techniques, achieving a fourfold reduction in GPU memory footprint on average, but also shows competitive performance in accuracy with fewer tunable parameters. These demonstrate that our method is highly suitable for practical transfer learning on hardware-constrained devices.
CVNov 25, 2024
Event-boosted Deformable 3D Gaussians for Dynamic Scene ReconstructionWenhao Xu, Wenming Weng, Yueyi Zhang et al.
Deformable 3D Gaussian Splatting (3D-GS) is limited by missing intermediate motion information due to the low temporal resolution of RGB cameras. To address this, we introduce the first approach combining event cameras, which capture high-temporal-resolution, continuous motion data, with deformable 3D-GS for dynamic scene reconstruction. We observe that threshold modeling for events plays a crucial role in achieving high-quality reconstruction. Therefore, we propose a GS-Threshold Joint Modeling strategy, creating a mutually reinforcing process that greatly improves both 3D reconstruction and threshold modeling. Moreover, we introduce a Dynamic-Static Decomposition strategy that first identifies dynamic areas by exploiting the inability of static Gaussians to represent motions, then applies a buffer-based soft decomposition to separate dynamic and static areas. This strategy accelerates rendering by avoiding unnecessary deformation in static areas, and focuses on dynamic areas to enhance fidelity. Additionally, we contribute the first event-inclusive 4D benchmark with synthetic and real-world dynamic scenes, on which our method achieves state-of-the-art performance.
LGJun 8, 2024
Regret Bounds for Episodic Risk-Sensitive Linear Quadratic RegulatorWenhao Xu, Xuefeng Gao, Xuedong He
Risk-sensitive linear quadratic regulator is one of the most fundamental problems in risk-sensitive optimal control. In this paper, we study online adaptive control of risk-sensitive linear quadratic regulator in the finite horizon episodic setting. We propose a simple least-squares greedy algorithm and show that it achieves $\widetilde{\mathcal{O}}(\log N)$ regret under a specific identifiability assumption, where $N$ is the total number of episodes. If the identifiability assumption is not satisfied, we propose incorporating exploration noise into the least-squares-based algorithm, resulting in an algorithm with $\widetilde{\mathcal{O}}(\sqrt{N})$ regret. To our best knowledge, this is the first set of regret bounds for episodic risk-sensitive linear quadratic regulator. Our proof relies on perturbation analysis of less-standard Riccati equations for risk-sensitive linear quadratic control, and a delicate analysis of the loss in the risk-sensitive performance criterion due to applying the suboptimal controller in the online learning process.
CVDec 15, 2019
Domain Agnostic Feature Learning for Image and Video Based Face Anti-spoofingSuman Saha, Wenhao Xu, Menelaos Kanakis et al.
Nowadays, the increasingly growing number of mobile and computing devices has led to a demand for safer user authentication systems. Face anti-spoofing is a measure towards this direction for bio-metric user authentication, and in particular face recognition, that tries to prevent spoof attacks. The state-of-the-art anti-spoofing techniques leverage the ability of deep neural networks to learn discriminative features, based on cues from the training set images or video samples, in an effort to detect spoof attacks. However, due to the particular nature of the problem, i.e. large variability due to factors like different backgrounds, lighting conditions, camera resolutions, spoof materials, etc., these techniques typically fail to generalize to new samples. In this paper, we explicitly tackle this problem and propose a class-conditional domain discriminator module, that, coupled with a gradient reversal layer, tries to generate live and spoof features that are discriminative, but at the same time robust against the aforementioned variability factors. Extensive experimental analysis shows the effectiveness of the proposed method over existing image- and video-based anti-spoofing techniques, both in terms of numerical improvement as well as when visualizing the learned features.
CVApr 30, 2019
Structured Prediction using cGANs with Fusion DiscriminatorFaisal Mahmood, Wenhao Xu, Nicholas J. Durr et al.
We propose the fusion discriminator, a single unified framework for incorporating conditional information into a generative adversarial network (GAN) for a variety of distinct structured prediction tasks, including image synthesis, semantic segmentation, and depth estimation. Much like commonly used convolutional neural network -- conditional Markov random field (CNN-CRF) models, the proposed method is able to enforce higher-order consistency in the model, but without being limited to a very specific class of potentials. The method is conceptually simple and flexible, and our experimental results demonstrate improvement on several diverse structured prediction tasks.