Zhuo Wang

CV
h-index16
34papers
2,096citations
Novelty53%
AI Score62

34 Papers

CVMar 10, 2022Code
Domain Generalization via Shuffled Style Assembly for Face Anti-Spoofing

Zhuo Wang, Zezheng Wang, Zitong Yu et al.

With diverse presentation attacks emerging continually, generalizable face anti-spoofing (FAS) has drawn growing attention. Most existing methods implement domain generalization (DG) on the complete representations. However, different image statistics may have unique properties for the FAS tasks. In this work, we separate the complete representation into content and style ones. A novel Shuffled Style Assembly Network (SSAN) is proposed to extract and reassemble different content and style features for a stylized feature space. Then, to obtain a generalized representation, a contrastive learning strategy is developed to emphasize liveness-related style information while suppress the domain-specific one. Finally, the representations of the correct assemblies are used to distinguish between living and spoofing during the inferring. On the other hand, despite the decent performance, there still exists a gap between academia and industry, due to the difference in data quantity and distribution. Thus, a new large-scale benchmark for FAS is built up to further evaluate the performance of algorithms in reality. Both qualitative and quantitative results on existing and proposed benchmarks demonstrate the effectiveness of our methods. The codes will be available at https://github.com/wangzhuo2019/SSAN.

CVAug 8, 2022Code
Depth Quality-Inspired Feature Manipulation for Efficient RGB-D and Video Salient Object Detection

Wenbo Zhang, Keren Fu, Zhuo Wang et al.

Recently CNN-based RGB-D salient object detection (SOD) has obtained significant improvement on detection accuracy. However, existing models often fail to perform well in terms of efficiency and accuracy simultaneously. This hinders their potential applications on mobile devices as well as many real-world problems. To bridge the accuracy gap between lightweight and large models for RGB-D SOD, in this paper, an efficient module that can greatly improve the accuracy but adds little computation is proposed. Inspired by the fact that depth quality is a key factor influencing the accuracy, we propose an efficient depth quality-inspired feature manipulation (DQFM) process, which can dynamically filter depth features according to depth quality. The proposed DQFM resorts to the alignment of low-level RGB and depth features, as well as holistic attention of the depth stream to explicitly control and enhance cross-modal fusion. We embed DQFM to obtain an efficient lightweight RGB-D SOD model called DFM-Net, where we in addition design a tailored depth backbone and a two-stage decoder as basic parts. Extensive experimental results on nine RGB-D datasets demonstrate that our DFM-Net outperforms recent efficient models, running at about 20 FPS on CPU with only 8.5Mb model size, and meanwhile being 2.9/2.4 times faster and 6.7/3.1 times smaller than the latest best models A2dele and MobileSal. It also maintains state-of-the-art accuracy when even compared to non-efficient models. Interestingly, further statistics and analyses verify the ability of DQFM in distinguishing depth maps of various qualities without any quality labels. Last but not least, we further apply DFM-Net to deal with video SOD (VSOD), achieving comparable performance against recent efficient models while being 3/2.3 times faster/smaller than the prior best in this field. Our code is available at https://github.com/zwbx/DFM-Net.

LGOct 22, 2023Code
Learning Interpretable Rules for Scalable Data Representation and Classification

Zhuo Wang, Wei Zhang, Ning Liu et al.

Rule-based models, e.g., decision trees, are widely used in scenarios demanding high model interpretability for their transparent inner structures and good model expressivity. However, rule-based models are hard to optimize, especially on large data sets, due to their discrete parameters and structures. Ensemble methods and fuzzy/soft rules are commonly used to improve performance, but they sacrifice the model interpretability. To obtain both good scalability and interpretability, we propose a new classifier, named Rule-based Representation Learner (RRL), that automatically learns interpretable non-fuzzy rules for data representation and classification. To train the non-differentiable RRL effectively, we project it to a continuous space and propose a novel training method, called Gradient Grafting, that can directly optimize the discrete model using gradient descent. A novel design of logical activation functions is also devised to increase the scalability of RRL and enable it to discretize the continuous features end-to-end. Exhaustive experiments on ten small and four large data sets show that RRL outperforms the competitive interpretable approaches and can be easily adjusted to obtain a trade-off between classification accuracy and model complexity for different scenarios. Our code is available at: https://github.com/12wang3/rrl.

91.0AIApr 12Code
Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

Wanyi Chen, Xiao Yang, Xu Yang et al.

We introduce Agent^2 RL-Bench, a benchmark for evaluating agentic RL post-training -- whether LLM agents can autonomously design, implement, and run complete RL pipelines that improve foundation models. This capability is important because RL post-training increasingly drives model alignment and specialization, yet existing benchmarks remain largely static: supervised fine-tuning alone yields strong results, leaving interactive RL engineering untested. Agent^2 RL-Bench addresses this with six tasks across three levels -- from static rule-based training to closed-loop online RL with trajectory collection -- each adding a structural requirement that prior levels do not impose. The benchmark provides isolated workspaces with a grading API, runtime instrumentation that records every submission and code revision, and automated post-hoc analysis that generates structured run reports, enabling the first automated diagnostic of agent-driven post-training behavior. Across multiple agent stacks spanning five agent systems and six driver LLMs, we find that agents achieve striking interactive gains -- on ALFWorld, an RL-only agent improves from 5.97 to 93.28 via SFT warm-up and GRPO with online rollouts -- yet make only marginal progress on others (DeepSearchQA: +2.75 within evaluation noise), and that driver choice has a large effect on interactive tasks -- within the same scaffold, switching drivers changes interactive improvement from near-zero to +78pp. More broadly, the benchmark reveals that supervised pipelines dominate agent-driven post-training under fixed budgets, with online RL succeeding as the final best route only on ALFWorld. Code is available at https://github.com/microsoft/RD-Agent/tree/main/rdagent/scenarios/rl/autorl_bench.

DCApr 9, 2022
Searching for Efficient Neural Architectures for On-Device ML on Edge TPUs

Berkin Akin, Suyog Gupta, Yun Long et al.

On-device ML accelerators are becoming a standard in modern mobile system-on-chips (SoC). Neural architecture search (NAS) comes to the rescue for efficiently utilizing the high compute throughput offered by these accelerators. However, existing NAS frameworks have several practical limitations in scaling to multiple tasks and different target platforms. In this work, we provide a two-pronged approach to this challenge: (i) a NAS-enabling infrastructure that decouples model cost evaluation, search space design, and the NAS algorithm to rapidly target various on-device ML tasks, and (ii) search spaces crafted from group convolution based inverted bottleneck (IBN) variants that provide flexible quality/performance trade-offs on ML accelerators, complementing the existing full and depthwise convolution based IBNs. Using this approach we target a state-of-the-art mobile platform, Google Tensor SoC, and demonstrate neural architectures that improve the quality-performance pareto frontier for various computer vision (classification, detection, segmentation) as well as natural language processing tasks.

CLJun 2, 2023
Can LLMs like GPT-4 outperform traditional AI tools in dementia diagnosis? Maybe, but not today

Zhuo Wang, Rongzhen Li, Bowen Dong et al.

Recent investigations show that large language models (LLMs), specifically GPT-4, not only have remarkable capabilities in common Natural Language Processing (NLP) tasks but also exhibit human-level performance on various professional and academic benchmarks. However, whether GPT-4 can be directly used in practical applications and replace traditional artificial intelligence (AI) tools in specialized domains requires further experimental validation. In this paper, we explore the potential of LLMs such as GPT-4 to outperform traditional AI tools in dementia diagnosis. Comprehensive comparisons between GPT-4 and traditional AI tools are conducted to examine their diagnostic accuracy in a clinical setting. Experimental results on two real clinical datasets show that, although LLMs like GPT-4 demonstrate potential for future advancements in dementia diagnosis, they currently do not surpass the performance of traditional AI tools. The interpretability and faithfulness of GPT-4 are also evaluated by comparison with real doctors. We discuss the limitations of GPT-4 in its current state and propose future research directions to enhance GPT-4 in dementia diagnosis.

CLNov 10, 2022
Not Just Plain Text! Fuel Document-Level Relation Extraction with Explicit Syntax Refinement and Subsentence Modeling

Zhichao Duan, Xiuxing Li, Zhenyu Li et al.

Document-level relation extraction (DocRE) aims to identify semantic labels among entities within a single document. One major challenge of DocRE is to dig decisive details regarding a specific entity pair from long text. However, in many cases, only a fraction of text carries required information, even in the manually labeled supporting evidence. To better capture and exploit instructive information, we propose a novel expLicit syntAx Refinement and Subsentence mOdeliNg based framework (LARSON). By introducing extra syntactic information, LARSON can model subsentences of arbitrary granularity and efficiently screen instructive ones. Moreover, we incorporate refined syntax into text representations which further improves the performance of LARSON. Experimental results on three benchmark datasets (DocRED, CDR, and GDA) demonstrate that LARSON significantly outperforms existing methods.

93.7AIApr 16
CoTEvol: Self-Evolving Chain-of-Thoughts for Data Synthesis in Mathematical Reasoning

Zhuo Wang, Zhuo Zhang, Yafu Li et al.

Large Language Models (LLMs) exhibit strong mathematical reasoning when trained on high-quality Chain-of-Thought (CoT) that articulates intermediate steps, yet costly CoT curation hinders further progress. While existing remedies such as distillation from stronger LLMs and self-synthesis based on test-time search alleviate this issue, they often suffer from diminishing returns or high computing overhead.In this work, we propose CoTEvol, a genetic evolutionary framework that casts CoT generation as a population-based search over reasoning trajectories.Candidate trajectories are iteratively evolved through reflective global crossover at the trajectory level and local mutation guided by uncertainty at the step level, enabling holistic recombination and fine-grained refinement. Lightweight, task-aware fitness functions are designed to guide the evolutionary process toward accurate and diverse reasoning. Empirically, CoTEvol improves correct-CoT synthesis success by over 30% and enhances structural diversity, with markedly improved efficiency. LLMs trained on these evolutionary CoT data achieve an average gain of 6.6% across eight math benchmarks, outperforming previous distillation and self-synthesis approaches. These results underscore the promise of evolutionary CoT synthesis as a scalable and effective method for mathematical reasoning tasks.

CVFeb 3Code
High-Resolution Underwater Camouflaged Object Detection: GBU-UCOD Dataset and Topology-Aware and Frequency-Decoupled Networks

Wenji Wu, Shuo Ye, Yiyu Liu et al.

Underwater Camouflaged Object Detection (UCOD) is a challenging task due to the extreme visual similarity between targets and backgrounds across varying marine depths. Existing methods often struggle with topological fragmentation of slender creatures in the deep sea and the subtle feature extraction of transparent organisms. In this paper, we propose DeepTopo-Net, a novel framework that integrates topology-aware modeling with frequency-decoupled perception. To address physical degradation, we design the Water-Conditioned Adaptive Perceptor (WCAP), which employs Riemannian metric tensors to dynamically deform convolutional sampling fields. Furthermore, the Abyssal-Topology Refinement Module (ATRM) is developed to maintain the structural connectivity of spindly targets through skeletal priors. Specifically, we first introduce GBU-UCOD, the first high-resolution (2K) benchmark tailored for marine vertical zonation, filling the data gap for hadal and abyssal zones. Extensive experiments on MAS3K, RMAS, and our proposed GBU-UCOD datasets demonstrate that DeepTopo-Net achieves state-of-the-art performance, particularly in preserving the morphological integrity of complex underwater patterns. The datasets and codes will be released at https://github.com/Wuwenji18/GBU-UCOD.

SDJan 26, 2025Code
AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement

Junan Zhang, Jing Yang, Zihao Fang et al.

We introduce AnyEnhance, a unified generative model for voice enhancement that processes both speech and singing voices. Based on a masked generative model, AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning. AnyEnhance introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker's timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. Extensive experiments on various enhancement tasks demonstrate AnyEnhance outperforms existing methods in terms of both objective metrics and subjective listening tests. Demo audios are publicly available at https://amphionspace.github.io/anyenhance. An open-source implementation is provided at https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc.

CVJan 27Code
Contrastive Spectral Rectification: Test-Time Defense towards Zero-shot Adversarial Robustness of CLIP

Sen Nie, Jie Zhang, Zhuo Wang et al.

Vision-language models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, yet remain highly vulnerable to adversarial examples (AEs). While test-time defenses are promising, existing methods fail to provide sufficient robustness against strong attacks and are often hampered by high inference latency and task-specific applicability. To address these limitations, we start by investigating the intrinsic properties of AEs, which reveals that AEs exhibit severe feature inconsistency under progressive frequency attenuation. We further attribute this to the model's inherent spectral bias. Leveraging this insight, we propose an efficient test-time defense named Contrastive Spectral Rectification (CSR). CSR optimizes a rectification perturbation to realign the input with the natural manifold under a spectral-guided contrastive objective, which is applied input-adaptively. Extensive experiments across 16 classification benchmarks demonstrate that CSR outperforms the SOTA by an average of 18.1% against strong AutoAttack with modest inference overhead. Furthermore, CSR exhibits broad applicability across diverse visual tasks. Code is available at https://github.com/Summu77/CSR.

CVOct 5, 2022
Cloud removal Using Atmosphere Model

Yi Guo, Feng Li, Zhuo Wang

Cloud removal is an essential task in remote sensing data analysis. As the image sensors are distant from the earth ground, it is likely that part of the area of interests is covered by cloud. Moreover, the atmosphere in between creates a constant haze layer upon the acquired images. To recover the ground image, we propose to use scattering model for temporal sequence of images of any scene in the framework of low rank and sparse models. We further develop its variant, which is much faster and yet more accurate. To measure the performance of different methods {\em objectively}, we develop a semi-realistic simulation method to produce cloud cover so that various methods can be quantitatively analysed, which enables detailed study of many aspects of cloud removal algorithms, including verifying the effectiveness of proposed models in comparison with the state-of-the-arts, including deep learning models, and addressing the long standing problem of the determination of regularisation parameters. The latter is companioned with theoretic analysis on the range of the sparsity regularisation parameter and verified numerically.

LGJul 2, 2025Code
Test-Time Scaling with Reflective Generative Model

Zixiao Wang, Yuxin Wang, Xiaorui Wang et al.

We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.

AIMar 2
FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

Qizheng Li, Yifei Zhang, Xiao Yang et al.

Fine-tuning large language models for vertical domains remains a labor-intensive and expensive process, requiring domain experts to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning, no prior work has tackled end-to-end LLM fine-tuning with agents. Can LLM-based agents automate this complete process? We frame this as a substantially open problem: agents must navigate an open-ended search space spanning data curation from diverse data sources, processing with complex tools, building a training pipeline, and iteratively refining their approach based on evaluation outcomes in rapidly growing logs--an overall scenario far more intricate than existing benchmarks. To study this question, we introduce FT-Dojo, an interactive environment comprising 13 tasks across 5 domains. We further develop FT-Agent, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback to iteratively diagnose failures and refine fine-tuning strategies. Experiments on FT-Dojo demonstrate that purpose-built fine-tuning agents significantly outperform general-purpose alternatives, with FT-Agent achieving the best performance on 10 out of 13 tasks across all five domains. Ablations show that the approach generalizes effectively to 3B models, with additional insights on data scaling trade-offs and backbone sensitivity. Case analyses reveal that agents can recover from failures through cumulative learning from historical experience, while also exposing fundamental limitations in causal reasoning--highlighting both the promise and current boundaries of autonomous LLM fine-tuning.

94.1AIMay 11
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

Mingxi Zou, Zhihan Guo, Langzhang Liang et al.

Long-horizon language agents must operate under limited runtime memory, yet existing memory mechanisms often organize experience around descriptive criteria such as relevance, salience, or summary quality. For an agent, however, memory is valuable not because it faithfully describes the past, but because it preserves the distinctions between histories that must remain separated under a fixed budget to support good decisions. We cast this as a decision-centric rate-distortion problem, measuring memory quality by the loss in achievable decision quality induced by compression. This yields an exact forgetting boundary for what can be safely forgotten, and a memory-distortion frontier characterizing the optimal tradeoff between memory budget and decision quality. Motivated by this decision-centric view of memory, we propose DeMem, an online memory learner that refines its partition only when data certify that a shared state would induce decision conflict, and prove near-minimax regret guarantees. On both controlled synthetic diagnostics and long-horizon conversational benchmarks, DeMem yields consistent gains under the same runtime budget, supporting the principle that memory should preserve the distinctions that matter for decisions, not descriptions.

LGSep 30, 2021Code
Scalable Rule-Based Representation Learning for Interpretable Classification

Zhuo Wang, Wei Zhang, Ning Liu et al.

Rule-based models, e.g., decision trees, are widely used in scenarios demanding high model interpretability for their transparent inner structures and good model expressivity. However, rule-based models are hard to optimize, especially on large data sets, due to their discrete parameters and structures. Ensemble methods and fuzzy/soft rules are commonly used to improve performance, but they sacrifice the model interpretability. To obtain both good scalability and interpretability, we propose a new classifier, named Rule-based Representation Learner (RRL), that automatically learns interpretable non-fuzzy rules for data representation and classification. To train the non-differentiable RRL effectively, we project it to a continuous space and propose a novel training method, called Gradient Grafting, that can directly optimize the discrete model using gradient descent. An improved design of logical activation functions is also devised to increase the scalability of RRL and enable it to discretize the continuous features end-to-end. Exhaustive experiments on nine small and four large data sets show that RRL outperforms the competitive interpretable approaches and can be easily adjusted to obtain a trade-off between classification accuracy and model complexity for different scenarios. Our code is available at: https://github.com/12wang3/rrl.

CVJul 5, 2021Code
Depth Quality-Inspired Feature Manipulation for Efficient RGB-D Salient Object Detection

Wenbo Zhang, Ge-Peng Ji, Zhuo Wang et al.

RGB-D salient object detection (SOD) recently has attracted increasing research interest by benefiting conventional RGB SOD with extra depth information. However, existing RGB-D SOD models often fail to perform well in terms of both efficiency and accuracy, which hinders their potential applications on mobile devices and real-world problems. An underlying challenge is that the model accuracy usually degrades when the model is simplified to have few parameters. To tackle this dilemma and also inspired by the fact that depth quality is a key factor influencing the accuracy, we propose a novel depth quality-inspired feature manipulation (DQFM) process, which is efficient itself and can serve as a gating mechanism for filtering depth features to greatly boost the accuracy. DQFM resorts to the alignment of low-level RGB and depth features, as well as holistic attention of the depth stream to explicitly control and enhance cross-modal fusion. We embed DQFM to obtain an efficient light-weight model called DFM-Net, where we also design a tailored depth backbone and a two-stage decoder for further efficiency consideration. Extensive experimental results demonstrate that our DFM-Net achieves state-of-the-art accuracy when comparing to existing non-efficient models, and meanwhile runs at 140ms on CPU (2.2$\times$ faster than the prior fastest efficient model) with only $\sim$8.5Mb model size (14.9% of the prior lightest). Our code will be available at https://github.com/zwbx/DFM-Net.

LGDec 10, 2019Code
Transparent Classification with Multilayer Logical Perceptrons and Random Binarization

Zhuo Wang, Wei Zhang, Ning Liu et al.

Models with transparent inner structure and high classification performance are required to reduce potential risk and provide trust for users in domains like health care, finance, security, etc. However, existing models are hard to simultaneously satisfy the above two properties. In this paper, we propose a new hierarchical rule-based model for classification tasks, named Concept Rule Sets (CRS), which has both a strong expressive ability and a transparent inner structure. To address the challenge of efficiently learning the non-differentiable CRS model, we propose a novel neural network architecture, Multilayer Logical Perceptron (MLLP), which is a continuous version of CRS. Using MLLP and the Random Binarization (RB) method we proposed, we can search the discrete solution of CRS in continuous space using gradient descent and ensure the discrete CRS acts almost the same as the corresponding continuous MLLP. Experiments on 12 public data sets show that CRS outperforms the state-of-the-art approaches and the complexity of the learned CRS is close to the simple decision tree. Source code is available at https://github.com/12wang3/mllp.

CVAug 6, 2024
Diverse Generation while Maintaining Semantic Coordination: A Diffusion-Based Data Augmentation Method for Object Detection

Sen Nie, Zhuo Wang, Xinxin Wang et al.

Recent studies emphasize the crucial role of data augmentation in enhancing the performance of object detection models. However,existing methodologies often struggle to effectively harmonize dataset diversity with semantic coordination.To bridge this gap, we introduce an innovative augmentation technique leveraging pre-trained conditional diffusion models to mediate this balance. Our approach encompasses the development of a Category Affinity Matrix, meticulously designed to enhance dataset diversity, and a Surrounding Region Alignment strategy, which ensures the preservation of semantic coordination in the augmented images. Extensive experimental evaluations confirm the efficacy of our method in enriching dataset diversity while seamlessly maintaining semantic coordination. Our method yields substantial average improvements of +1.4AP, +0.9AP, and +3.4AP over existing alternatives on three distinct object detection models, respectively.

AIDec 30, 2025
LoongFlow: Directed Evolutionary Search via a Cognitive Plan-Execute-Summarize Paradigm

Chunhui Wan, Xunan Dai, Zhuo Wang et al.

The transition from static Large Language Models (LLMs) to self-improving agents is hindered by the lack of structured reasoning in traditional evolutionary approaches. Existing methods often struggle with premature convergence and inefficient exploration in high-dimensional code spaces. To address these challenges, we introduce LoongFlow, a self-evolving agent framework that achieves state-of-the-art solution quality with significantly reduced computational costs. Unlike "blind" mutation operators, LoongFlow integrates LLMs into a cognitive "Plan-Execute-Summarize" (PES) paradigm, effectively mapping the evolutionary search to a reasoning-heavy process. To sustain long-term architectural coherence, we incorporate a hybrid evolutionary memory system. By synergizing Multi-Island models with MAP-Elites and adaptive Boltzmann selection, this system theoretically balances the exploration-exploitation trade-off, maintaining diverse behavioral niches to prevent optimization stagnation. We instantiate LoongFlow with a General Agent for algorithmic discovery and an ML Agent for pipeline optimization. Extensive evaluations on the AlphaEvolve benchmark and Kaggle competitions demonstrate that LoongFlow outperforms leading baselines (e.g., OpenEvolve, ShinkaEvolve) by up to 60% in evolutionary efficiency while discovering superior solutions. LoongFlow marks a substantial step forward in autonomous scientific discovery, enabling the generation of expert-level solutions with reduced computational overhead.

CVDec 11, 2025
RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection

Zhuo Wang, Xiliang Liu, Ligang Sun

The proliferation of AI-generated video technologies poses challenges to information integrity. While recent benchmarks advance AIGC video detection, they overlook a critical factor: many state-of-the-art generative models embed digital watermarks in outputs, and detectors may partially rely on these patterns. To evaluate this influence, we present RobustSora, the benchmark designed to assess watermark robustness in AIGC video detection. We systematically construct a dataset of 6,500 videos comprising four types: Authentic-Clean (A-C), Authentic-Spoofed with fake watermarks (A-S), Generated-Watermarked (G-W), and Generated-DeWatermarked (G-DeW). Our benchmark introduces two evaluation tasks: Task-I tests performance on watermark-removed AI videos, while Task-II assesses false alarm rates on authentic videos with fake watermarks. Experiments with ten models spanning specialized AIGC detectors, transformer architectures, and MLLM approaches reveal performance variations of 2-8pp under watermark manipulation. Transformer-based models show consistent moderate dependency (6-8pp), while MLLMs exhibit diverse patterns (2-8pp). These findings indicate partial watermark dependency and highlight the need for watermark-aware training strategies. RobustSora provides essential tools to advance robust AIGC detection research.

SEDec 5, 2025
Toward Patch Robustness Certification and Detection for Deep Learning Systems Beyond Consistent Samples

Qilin Zhou, Zhengyuan Wei, Haipeng Wang et al.

Patch robustness certification is an emerging kind of provable defense technique against adversarial patch attacks for deep learning systems. Certified detection ensures the detection of all patched harmful versions of certified samples, which mitigates the failures of empirical defense techniques that could (easily) be compromised. However, existing certified detection methods are ineffective in certifying samples that are misclassified or whose mutants are inconsistently pre icted to different labels. This paper proposes HiCert, a novel masking-based certified detection technique. By focusing on the problem of mutants predicted with a label different from the true label with our formal analysis, HiCert formulates a novel formal relation between harmful samples generated by identified loopholes and their benign counterparts. By checking the bound of the maximum confidence among these potentially harmful (i.e., inconsistent) mutants of each benign sample, HiCert ensures that each harmful sample either has the minimum confidence among mutants that are predicted the same as the harmful sample itself below this bound, or has at least one mutant predicted with a label different from the harmful sample itself, formulated after two novel insights. As such, HiCert systematically certifies those inconsistent samples and consistent samples to a large extent. To our knowledge, HiCert is the first work capable of providing such a comprehensive patch robustness certification for certified detection. Our experiments show the high effectiveness of HiCert with a new state-of the-art performance: It certifies significantly more benign samples, including those inconsistent and consistent, and achieves significantly higher accuracy on those samples without warnings and a significantly lower false silent ratio.

AISep 8, 2025
From Implicit Exploration to Structured Reasoning: Leveraging Guideline and Refinement for LLMs

Jiaxiang Chen, Zhuo Wang, Mingxi Zou et al. · microsoft-research

Large language models (LLMs) have advanced general-purpose reasoning, showing strong performance across diverse tasks. However, existing methods often rely on implicit exploration, where the model follows stochastic and unguided reasoning paths-like walking without a map. This leads to unstable reasoning paths, lack of error correction, and limited learning from past experience. To address these issues, we propose a framework that shifts from implicit exploration to structured reasoning through guideline and refinement. First, we extract structured reasoning patterns from successful trajectories and reflective signals from failures. During inference, the model follows these guidelines step-by-step, with refinement applied after each step to correct errors and stabilize the reasoning process. Experiments on BBH and four additional benchmarks (GSM8K, MATH-500, MBPP, HumanEval) show that our method consistently outperforms strong baselines across diverse reasoning tasks. Structured reasoning with stepwise execution and refinement improves stability and generalization, while guidelines transfer well across domains and flexibly support cross-model collaboration, matching or surpassing supervised fine-tuning in effectiveness and scalability.

SDAug 8, 2025
DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching

Wei Chen, Binzhu Sha, Dan Luo et al.

Singing Voice Conversion (SVC) transfers a source singer's timbre to a target while keeping melody and lyrics. The key challenge in any-to-any SVC is adapting unseen speaker timbres to source audio without quality degradation. Existing methods either face timbre leakage or fail to achieve satisfactory timbre similarity and quality in the generated audio. To address these challenges, we propose DAFMSVC, where the self-supervised learning (SSL) features from the source audio are replaced with the most similar SSL features from the target audio to prevent timbre leakage. It also incorporates a dual cross-attention mechanism for the adaptive fusion of speaker embeddings, melody, and linguistic content. Additionally, we introduce a flow matching module for high quality audio generation from the fused features. Experimental results show that DAFMSVC significantly enhances timbre similarity and naturalness, outperforming state-of-the-art methods in both subjective and objective evaluations.

LGJun 10, 2025
FinHEAR: Human Expertise and Adaptive Risk-Aware Temporal Reasoning for Financial Decision-Making

Jiaxiang Chen, Mingxi Zou, Zhuo Wang et al.

Financial decision-making presents unique challenges for language models, demanding temporal reasoning, adaptive risk assessment, and responsiveness to dynamic events. While large language models (LLMs) show strong general reasoning capabilities, they often fail to capture behavioral patterns central to human financial decisions-such as expert reliance under information asymmetry, loss-averse sensitivity, and feedback-driven temporal adjustment. We propose FinHEAR, a multi-agent framework for Human Expertise and Adaptive Risk-aware reasoning. FinHEAR orchestrates specialized LLM-based agents to analyze historical trends, interpret current events, and retrieve expert-informed precedents within an event-centric pipeline. Grounded in behavioral economics, it incorporates expert-guided retrieval, confidence-adjusted position sizing, and outcome-based refinement to enhance interpretability and robustness. Empirical results on curated financial datasets show that FinHEAR consistently outperforms strong baselines across trend prediction and trading tasks, achieving higher accuracy and better risk-adjusted returns.

AIJun 9, 2025
Guideline Forest: Experience-Induced Multi-Guideline Reasoning with Stepwise Aggregation

Jiaxiang Chen, Zhuo Wang, Mingxi Zou et al.

Human reasoning is flexible, adaptive, and grounded in prior experience-qualities that large language models (LLMs) still struggle to emulate. Existing methods either explore diverse reasoning paths at inference time or search for optimal workflows through expensive operations, but both fall short in leveraging multiple reusable strategies in a structured, efficient manner. We propose Guideline Forest, a framework that enhances LLMs reasoning by inducing structured reasoning strategies-called guidelines-from verified examples and executing them via step-wise aggregation. Unlike test-time search or single-path distillation, our method draws on verified reasoning experiences by inducing reusable guidelines and expanding each into diverse variants. Much like human reasoning, these variants reflect alternative thought patterns, are executed in parallel, refined via self-correction, and aggregated step by step-enabling the model to adaptively resolve uncertainty and synthesize robust solutions.We evaluate Guideline Forest on four benchmarks-GSM8K, MATH-500, MBPP, and HumanEval-spanning mathematical and programmatic reasoning. Guideline Forest consistently outperforms strong baselines, including CoT, ReAct, ToT, FoT, and AFlow. Ablation studies further highlight the effectiveness of multi-path reasoning and stepwise aggregation, underscoring the Guideline Forest's adaptability and generalization potential.

CVFeb 21, 2025
AutoMR: A Universal Time Series Motion Recognition Pipeline

Likun Zhang, Sicheng Yang, Zhuo Wang et al.

In this paper, we present an end-to-end automated motion recognition (AutoMR) pipeline designed for multimodal datasets. The proposed framework seamlessly integrates data preprocessing, model training, hyperparameter tuning, and evaluation, enabling robust performance across diverse scenarios. Our approach addresses two primary challenges: 1) variability in sensor data formats and parameters across datasets, which traditionally requires task-specific machine learning implementations, and 2) the complexity and time consumption of hyperparameter tuning for optimal model performance. Our library features an all-in-one solution incorporating QuartzNet as the core model, automated hyperparameter tuning, and comprehensive metrics tracking. Extensive experiments demonstrate its effectiveness on 10 diverse datasets, achieving state-of-the-art performance. This work lays a solid foundation for deploying motion-capture solutions across varied real-world applications.

CVFeb 17, 2025
Duo Streamers: A Streaming Gesture Recognition Framework

Boxuan Zhu, Sicheng Yang, Zhuo Wang et al.

Gesture recognition in resource-constrained scenarios faces significant challenges in achieving high accuracy and low latency. The streaming gesture recognition framework, Duo Streamers, proposed in this paper, addresses these challenges through a three-stage sparse recognition mechanism, an RNN-lite model with an external hidden state, and specialized training and post-processing pipelines, thereby making innovative progress in real-time performance and lightweight design. Experimental results show that Duo Streamers matches mainstream methods in accuracy metrics, while reducing the real-time factor by approximately 92.3%, i.e., delivering a nearly 13-fold speedup. In addition, the framework shrinks parameter counts to 1/38 (idle state) and 1/9 (busy state) compared to mainstream models. In summary, Duo Streamers not only offers an efficient and practical solution for streaming gesture recognition in resource-constrained devices but also lays a solid foundation for extended applications in multimodal and diverse scenarios.

SDFeb 7, 2025
Singing Voice Conversion with Accompaniment Using Self-Supervised Representation-Based Melody Features

Wei Chen, Binzhu Sha, Jing Yang et al.

Melody preservation is crucial in singing voice conversion (SVC). However, in many scenarios, audio is often accompanied with background music (BGM), which can cause audio distortion and interfere with the extraction of melody and other key features, significantly degrading SVC performance. Previous methods have attempted to address this by using more robust neural network-based melody extractors, but their performance drops sharply in the presence of complex accompaniment. Other approaches involve performing source separation before conversion, but this often introduces noticeable artifacts, leading to a significant drop in conversion quality and increasing the user's operational costs. To address these issues, we introduce a novel SVC method that uses self-supervised representation-based melody features to improve melody modeling accuracy in the presence of BGM. In our experiments, we compare the effectiveness of different self-supervised learning (SSL) models for melody extraction and explore for the first time how SSL benefits the task of melody extraction. The experimental results demonstrate that our proposed SVC model significantly outperforms existing baseline methods in terms of melody accuracy and shows higher similarity and naturalness in both subjective and objective evaluations across noisy and clean audio environments.

APJan 26, 2024
Validating Climate Models with Spherical Convolutional Wasserstein Distance

Robert C. Garrett, Trevor Harris, Bo Li et al.

The validation of global climate models is crucial to ensure the accuracy and efficacy of model output. We introduce the spherical convolutional Wasserstein distance to more comprehensively measure differences between climate models and reanalysis data. This new similarity measure accounts for spatial variability using convolutional projections and quantifies local differences in the distribution of climate variables. We apply this method to evaluate the historical model outputs of the Coupled Model Intercomparison Project (CMIP) members by comparing them to observational and reanalysis data products. Additionally, we investigate the progression from CMIP phase 5 to phase 6 and find modest improvements in the phase 6 models regarding their ability to produce realistic climatologies.

ROJan 12, 2020
Learning to drive via Apprenticeship Learning and Deep Reinforcement Learning

Wenhui Huang, Francesco Braghin, Zhuo Wang

With the implementation of reinforcement learning (RL) algorithms, current state-of-art autonomous vehicle technology have the potential to get closer to full automation. However, most of the applications have been limited to game domains or discrete action space which are far from the real world driving. Moreover, it is very tough to tune the parameters of reward mechanism since the driving styles vary a lot among the different users. For instance, an aggressive driver may prefer driving with high acceleration whereas some conservative drivers prefer a safer driving style. Therefore, we propose an apprenticeship learning in combination with deep reinforcement learning approach that allows the agent to learn the driving and stopping behaviors with continuous actions. We use gradient inverse reinforcement learning (GIRL) algorithm to recover the unknown reward function and employ REINFORCE as well as Deep Deterministic Policy Gradient algorithm (DDPG) to learn the optimal policy. The performance of our method is evaluated in simulation-based scenario and the results demonstrate that the agent performs human like driving and even better in some aspects after training.

IRNov 5, 2019
ColluEagle: Collusive review spammer detection using Markov random fields

Zhuo Wang, Runlong Hu, Qian Chen et al.

Product reviews are extremely valuable for online shoppers in providing purchase decisions. Driven by immense profit incentives, fraudsters deliberately fabricate untruthful reviews to distort the reputation of online products. As online reviews become more and more important, group spamming, i.e., a team of fraudsters working collaboratively to attack a set of target products, becomes a new fashion. Previous works use review network effects, i.e. the relationships among reviewers, reviews, and products, to detect fake reviews or review spammers, but ignore time effects, which are critical in characterizing group spamming. In this paper, we propose a novel Markov random field (MRF)-based method (ColluEagle) to detect collusive review spammers, as well as review spam campaigns, considering both network effects and time effects. First we identify co-review pairs, a review phenomenon that happens between two reviewers who review a common product in a similar way, and then model reviewers and their co-review pairs as a pairwise-MRF, and use loopy belief propagation to evaluate the suspiciousness of reviewers. We further design a high quality yet easy-to-compute node prior for ColluEagle, through which the review spammer groups can also be subsequently identified. Experiments show that ColluEagle can not only detect collusive spammers with high precision, significantly outperforming state-of-the-art baselines --- FraudEagle and SpEagle, but also identify highly suspicious review spammer campaigns.

CVJul 17, 2018
Bridging the Accuracy Gap for 2-bit Quantized Neural Networks (QNN)

Jungwook Choi, Pierce I-Jen Chuang, Zhuo Wang et al.

Deep learning algorithms achieve high classification accuracy at the expense of significant computation cost. In order to reduce this cost, several quantization schemes have gained attention recently with some focusing on weight quantization, and others focusing on quantizing activations. This paper proposes novel techniques that target weight and activation quantizations separately resulting in an overall quantized neural network (QNN). The activation quantization technique, PArameterized Clipping acTivation (PACT), uses an activation clipping parameter $α$ that is optimized during training to find the right quantization scale. The weight quantization scheme, statistics-aware weight binning (SAWB), finds the optimal scaling factor that minimizes the quantization error based on the statistical characteristics of the distribution of weights without the need for an exhaustive search. The combination of PACT and SAWB results in a 2-bit QNN that achieves state-of-the-art classification accuracy (comparable to full precision networks) across a range of popular models and datasets.

CVMay 16, 2018
PACT: Parameterized Clipping Activation for Quantized Neural Networks

Jungwook Choi, Zhuo Wang, Swagath Venkataramani et al.

Deep learning algorithms achieve high classification accuracy at the expense of significant computation cost. To address this cost, a number of quantization schemes have been proposed - but most of these techniques focused on quantizing weights, which are relatively smaller in size compared to activations. This paper proposes a novel quantization scheme for activations during training - that enables neural networks to work well with ultra low precision weights and activations without any significant accuracy degradation. This technique, PArameterized Clipping acTivation (PACT), uses an activation clipping parameter $α$ that is optimized during training to find the right quantization scale. PACT allows quantizing activations to arbitrary bit precisions, while achieving much better accuracy relative to published state-of-the-art quantization schemes. We show, for the first time, that both weights and activations can be quantized to 4-bits of precision while still achieving accuracy comparable to full precision networks across a range of popular models and datasets. We also show that exploiting these reduced-precision computational units in hardware can enable a super-linear improvement in inferencing performance due to a significant reduction in the area of accelerator compute engines coupled with the ability to retain the quantized model and activation data in on-chip memories.