Haotian Wu

CV
h-index66
25papers
180citations
Novelty57%
AI Score58

25 Papers

ASNov 11, 2022
Enhancing and Adversarial: Improve ASR with Speaker Labels

Wei Zhou, Haotian Wu, Jingjing Xu et al.

ASR can be improved by multi-task learning (MTL) with domain enhancing or domain adversarial training, which are two opposite objectives with the aim to increase/decrease domain variance towards domain-aware/agnostic ASR, respectively. In this work, we study how to best apply these two opposite objectives with speaker labels to improve conformer-based ASR. We also propose a novel adaptive gradient reversal layer for stable and effective adversarial training without tuning effort. Detailed analysis and experimental verification are conducted to show the optimal positions in the ASR neural network (NN) to apply speaker enhancing and adversarial training. We also explore their combination for further improvement, achieving the same performance as i-vectors plus adversarial training. Our best speaker-based MTL achieves 7\% relative improvement on the Switchboard Hub5'00 set. We also investigate the effect of such speaker-based MTL w.r.t. cleaner dataset and weaker ASR NN.

CLMay 31
Thinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLMs

Yubo Gao, Haotian Wu, Hong Chen et al.

Chain-of-Thought (CoT) has significantly enhanced LLM reasoning, yet often incurs substantial computational overhead due to "overthinking": generating excessively long rationales without commensurate accuracy gains. Existing efficiency methods typically apply uniform compression, which overlooks a critical observation that reasoning complexity is heterogeneous at two distinct granularity: across different problems and within individual reasoning steps. This motivates our principle of Thinking Economically: intelligently allocating computational resources based on intrinsic task and step demands rather than pursuing uniform brevity. We propose Hierarchical Adaptive Budgeter (HAB), a training framework that operationalizes this principle through coarse-to-fine budgeting. At the inter-step level, HAB predicts the optimal reasoning depth for each problem. At the intra-step level, HAB learns step-specific token budgeting signals from PPL-derived step comparisons and an adaptive Pareto optimization objective that captures the local quality-efficiency trade-off, while a Fisher Information-based pruner further provides fine-grained training-time guidance, thereby encouraging the generator to internalize more economical reasoning patterns. Experiments on GSM8K and MATH500 show that HAB not only surpasses standard CoT in accuracy but also reduces token usage, achieving a stronger performance-efficiency trade-off than the compared baselines.

CLNov 13, 2025Code
EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models

Junquan Huang, Haotian Wu, Yubo Gao et al.

Large language models (LLMs) with Chain-of-Thought (CoT) prompting achieve strong reasoning but often produce unnecessarily long explanations, increasing cost and sometimes reducing accuracy. Fair comparison of efficiency-oriented approaches is hindered by fragmented evaluation practices. We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods across three categories: Reasoning Blueprints, Dynamic Execution, and Post-hoc Refinement. To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA via a pipeline that enforces standardized reasoning structures, comprehensive option-wise analysis, and human verification. We evaluate 7 methods across 6 open-source LLMs (1B-70B) on 4 datasets spanning mathematics, commonsense, and logic, and propose the E3-Score, a principled metric inspired by economic trade-off modeling that provides smooth, stable evaluation without discontinuities or heavy reliance on heuristics. Experiments show that no single method universally dominates; optimal strategies depend on backbone scale, task complexity, and architecture.

CVSep 29, 2024
DiffCP: Ultra-Low Bit Collaborative Perception via Diffusion Model

Ruiqing Mao, Haotian Wu, Yukuan Jia et al.

Collaborative perception (CP) is emerging as a promising solution to the inherent limitations of stand-alone intelligence. However, current wireless communication systems are unable to support feature-level and raw-level collaborative algorithms due to their enormous bandwidth demands. In this paper, we propose DiffCP, a novel CP paradigm that utilizes a specialized diffusion model to efficiently compress the sensing information of collaborators. By incorporating both geometric and semantic conditions into the generative model, DiffCP enables feature-level collaboration with an ultra-low communication cost, advancing the practical implementation of CP systems. This paradigm can be seamlessly integrated into existing CP algorithms to enhance a wide range of downstream tasks. Through extensive experimentation, we investigate the trade-offs between communication, computation, and performance. Numerical results demonstrate that DiffCP can significantly reduce communication costs by 14.5-fold while maintaining the same performance as the state-of-the-art algorithm.

CVOct 12, 2022
GGViT:Multistream Vision Transformer Network in Face2Face Facial Reenactment Detection

Haotian Wu, Peipei Wang, Xin Wang et al.

Detecting manipulated facial images and videos on social networks has been an urgent problem to be solved. The compression of videos on social media has destroyed some pixel details that could be used to detect forgeries. Hence, it is crucial to detect manipulated faces in videos of different quality. We propose a new multi-stream network architecture named GGViT, which utilizes global information to improve the generalization of the model. The embedding of the whole face extracted by ViT will guide each stream network. Through a large number of experiments, we have proved that our proposed model achieves state-of-the-art classification accuracy on FF++ dataset, and has been greatly improved on scenarios of different compression rates. The accuracy of Raw/C23, Raw/C40 and C23/C40 was increased by 24.34%, 15.08% and 10.14% respectively.

IROct 1, 2023Code
TDCGL: Two-Level Debiased Contrastive Graph Learning for Recommendation

Yubo Gao, Haotian Wu

knowledge graph-based recommendation methods have achieved great success in the field of recommender systems. However, over-reliance on high-quality knowledge graphs is a bottleneck for such methods. Specifically, the long-tailed distribution of entities of KG and noise issues in the real world will make item-entity dependent relations deviate from reflecting true characteristics and significantly harm the performance of modeling user preference. Contrastive learning, as a novel method that is employed for data augmentation and denoising, provides inspiration to fill this research gap. However, the mainstream work only focuses on the long-tail properties of the number of items clicked, while ignoring that the long-tail properties of total number of clicks per user may also affect the performance of the recommendation model. Therefore, to tackle these problems, motivated by the Debiased Contrastive Learning of Unsupervised Sentence Representations (DCLR), we propose Two-Level Debiased Contrastive Graph Learning (TDCGL) model. Specifically, we design the Two-Level Debiased Contrastive Learning (TDCL) and deploy it in the KG, which is conducted not only on User-Item pairs but also on User-User pairs for modeling higher-order relations. Also, to reduce the bias caused by random sampling in contrastive learning, with the exception of the negative samples obtained by random sampling, we add a noise-based generation of negation to ensure spatial uniformity. Considerable experiments on open-source datasets demonstrate that our method has excellent anti-noise capability and significantly outperforms state-of-the-art baselines. In addition, ablation studies about the necessity for each level of TDCL are conducted.

CVApr 16
M3D-Net: Multi-Modal 3D Facial Feature Reconstruction Network for Deepfake Detection

Haotian Wu, Yue Cheng, Shan Bian

With the rapid advancement of deep learning in image generation, facial forgery techniques have achieved unprecedented realism, posing serious threats to cybersecurity and information authenticity. Most existing deepfake detection approaches rely on the reconstruction of isolated facial attributes without fully exploiting the complementary nature of multi-modal feature representations. To address these challenges, this paper proposes a novel Multi-Modal 3D Facial Feature Reconstruction Network (M3D-Net) for deepfake detection. Our method leverages an end-to-end dual-stream architecture that reconstructs fine-grained facial geometry and reflectance properties from single-view RGB images via a self-supervised 3D facial reconstruction module. The network further enhances detection performance through a 3D Feature Pre-fusion Module (PFM), which adaptively adjusts multi-scale features, and a Multi-modal Fusion Module (MFM) that effectively integrates RGB and 3D-reconstructed features using attention mechanisms. Extensive experiments on multiple public datasets demonstrate that our approach achieves state-of-the-art performance in terms of detection accuracy and robustness, significantly outperforming existing methods while exhibiting strong generalization across diverse scenarios.

ITMar 12
CNNs in the Air via Reconfigurable Intelligent Surfaces

Meng Hua, Haotian Wu, Deniz Gündüz

This paper introduces AirCNN, a novel paradigm for implementing convolutional neural networks (CNNs) via over-the-air (OTA) analog computation. By leveraging multiple reconfigurable intelligent surfaces (RISs) and transceiver designs, we engineer the ambient wireless propagation environment to emulate the operations of a CNN layer. To comprehensively evaluate AirCNN, we consider two types of CNNs, namely classic two-dimensional (2D) convolution (Conv2d) and light-weight convolution, i.e., depthwise separable convolution (ConvSD). For Conv2d realization via OTA computation, we propose and analyze two RIS-aided transmission architectures: multiple-input multiple-output (MIMO) and multiple-input single-output (MISO), balancing transmission overhead and emulation performance. We jointly optimize all parameters, including the transmitter precoder, receiver combiner, and RIS phase shifts, under practical constraints such as transmit power budget and unit-modulus phase shift requirements. We further extend the framework to ConvSD, which requires distinct transmission strategies for depthwise and pointwise convolutions. Simulation results demonstrate that the proposed AirCNN architectures can achieve satisfactory classification performance. Notably, Conv2d MISO consistently outperforms Conv2d MIMO across various settings, while for ConvSD, MISO is superior only under poor channel conditions. Moreover, employing multiple RISs significantly enhances performance compared to a single RIS, especially in line-of-sight (LoS)-dominated wireless environments.

LGJun 3, 2022
A High-Performance Customer Churn Prediction System based on Self-Attention

Haotian Wu

Customer churn prediction is a challenging domain of research that contributes to customer retention strategy. The predictive performance of existing machine learning models, which are often adopted by churn communities, appear to be at a bottleneck, partly due to models' poor feature extraction capability. Therefore, a novel algorithm, a hybrid neural network with self-attention enhancement (HNNSAE), is proposed in this paper to improve the efficiency of feature screening and feature extraction, consequently improving the model's predictive performance. This model consists of three main blocks. The first block is the entity embedding layer, which is employed to process the categorical variables transformed into 0-1 code. The second block is the feature extractor, which extracts the significant features through the multi-head self-attention mechanism. In addition, to improve the feature extraction effect, we stack the residual connection neural network on multi-head self-attention modules. The third block is a classifier, which is a three-layer multilayer perceptron. This work conducts experiments on publicly available dataset related to commercial bank customers. The result demonstrates that HNNSAE significantly outperforms the other Individual Machine Learning (IML), Ensemble Machine Learning (EML), and Deep Learning (DL) methods tested in this paper. Furthermore, we compare the performance of the feature extractor proposed in this paper with that of other three feature extractors and find that the method proposed in this paper significantly outperforms other methods. In addition, four hypotheses about model prediction performance and overfitting risk are tested on the publicly available dataset.

CVMar 14, 2022
Cross-View-Prediction: Exploring Contrastive Feature for Hyperspectral Image Classification

Anyu Zhang, Haotian Wu, Zeyu Cao

This paper presents a self-supervised feature learning method for hyperspectral image classification. Our method tries to construct two different views of the raw hyperspectral image through a cross-representation learning method. And then to learn semantically consistent representation over the created views by contrastive learning method. Specifically, four cross-channel-prediction based augmentation methods are naturally designed to utilize the high dimension characteristic of hyperspectral data for the view construction. And the better representative features are learned by maximizing mutual information and minimizing conditional entropy across different views from our contrastive network. This 'Cross-View-Predicton' style is straightforward and gets the state-of-the-art performance of unsupervised classification with a simple SVM classifier.

ROMay 13
TouchAnything: A Dataset and Framework for Bimanual Tactile Estimation from Egocentric Video

Jianyi Zhou, Ziteng Gao, Feiyang Hong et al.

Egocentric human video data, which captures rich human-environment interactions and can be collected at scale, has become a key driver of embodied intelligence research. However, existing egocentric datasets typically lack tactile sensing, a critical modality that provides direct cues about contact, force, and pressure in human-object interaction. Without such signals, models struggle to learn physically grounded representations of real-world interaction dynamics. While tactile sensors provide these cues, deploying high-quality tactile hardware at scale remains expensive and cumbersome. This raises a central question: can tactile feedback be inferred directly from visual observations, enabling scalable tactile supervision for egocentric video data and supporting physically grounded embodied learning? To enable research in this direction, we introduce EgoTouch, a large-scale multi-view egocentric dataset with dense tactile supervision for bimanual hand-object interaction. EgoTouch comprises 208 manipulation tasks spanning 1,891 episodes in diverse indoor and outdoor environments, with synchronized multi-view RGB (head-mounted egocentric and dual wrist-mounted cameras), bimanual 3D hand pose, and continuous pressure maps from wearable tactile sensors. Building on EgoTouch, we introduce TouchAnything, a baseline multi-view vision-to-touch prediction framework that uses the egocentric view as the primary input and flexibly leverages available wrist-mounted views at inference time. Experiments show that incorporating wrist-mounted views generally improves tactile prediction over egocentric-only input, achieving up to 5.0% relative improvement in Contact IoU and 6.1% relative improvement in Volumetric IoU. We will publicly release the dataset, code, and benchmark.

LGMay 11
DataMaster: Towards Autonomous Data Engineering for Machine Learning

Yaxin Du, Xiyuan Yang, Zhifan Zhou et al.

As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipelines, validate candidate data through downstream training, and carry forward lessons from prior attempts. We study task-conditioned autonomous data engineering, where an autonomous agent improves a fixed learning algorithm by optimizing only the data side, including external data discovery, data selection and composition, cleaning and transformation. The goal is to obtain a stronger downstream solution while leaving the learning algorithm unchanged. To address the open-ended search space, branch-dependent refinement, and delayed validation inherent in autonomous data engineering, we propose DataMaster, a data-agent framework that integrates tree-structured search, shared candidate data, and cumulative memory. DataMaster consists of three key components: a DataTree that organizes alternative data-engineering branches, a shared Data Pool that stores discovered external data sources for reuse, and a Global Memory that records node outcomes, artifacts, and reusable findings. Together, these components allow the agent to discover candidate data, construct executable training inputs, evaluate them through downstream feedback, and carry useful evidence across branches. We evaluate DataMaster on two types of benchmarks, MLE-Bench Lite and PostTrainBench. On MLE-Bench Lite, it improves medal rate by 32.27% over the initial score; on PostTrainBench, it surpasses the instruct model on GPQA (31.02% vs 30.35%).

CLMar 3
ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation

Bo Xu, Haotian Wu, Hehai Lin et al.

Model merging aims to combine multiple task-specific expert models into a single model while preserving generalization across diverse tasks. However, interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation. Despite recent progress, resolving this interference without data access, retraining, or architectural modification remains a fundamental challenge. This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce \acem, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. Our approach features a principled, closed-form solution that contrasts with prior iterative or heuristic methods. Extensive experiments on both vision and language benchmarks demonstrate that \acem sets a new state-of-the-art among data-free methods. It consistently outperforms existing baselines; for example, \acem achieves an average absolute improvement of 4\% over the previous methods across seven tasks on GPT-2. Owing to its efficient closed-form formulation, \acem delivers superior performance with a modest computational cost, providing a practical and theoretically grounded solution for model merging.

CVFeb 11, 2025
Multiview Point Cloud Registration Based on Minimum Potential Energy for Free-Form Blade Measurement

Zijie Wu, Yaonan Wang, Yang Mo et al.

Point cloud registration is an essential step for free-form blade reconstruction in industrial measurement. Nonetheless, measuring defects of the 3D acquisition system unavoidably result in noisy and incomplete point cloud data, which renders efficient and accurate registration challenging. In this paper, we propose a novel global registration method that is based on the minimum potential energy (MPE) method to address these problems. The basic strategy is that the objective function is defined as the minimum potential energy optimization function of the physical registration system. The function distributes more weight to the majority of inlier points and less weight to the noise and outliers, which essentially reduces the influence of perturbations in the mathematical formulation. We decompose the solution into a globally optimal approximation procedure and a fine registration process with the trimmed iterative closest point algorithm to boost convergence. The approximation procedure consists of two main steps. First, according to the construction of the force traction operator, we can simply compute the position of the potential energy minimum. Second, to find the MPE point, we propose a new theory that employs two flags to observe the status of the registration procedure. We demonstrate the performance of the proposed algorithm on four types of blades. The proposed method outperforms the other global methods in terms of both accuracy and noise resistance.

CVMay 8, 2024
Pedestrian Attribute Recognition as Label-balanced Multi-label Learning

Yibo Zhou, Hai-Miao Hu, Yirong Xiang et al.

Rooting in the scarcity of most attributes, realistic pedestrian attribute datasets exhibit unduly skewed data distribution, from which two types of model failures are delivered: (1) label imbalance: model predictions lean greatly towards the side of majority labels; (2) semantics imbalance: model is easily overfitted on the under-represented attributes due to their insufficient semantic diversity. To render perfect label balancing, we propose a novel framework that successfully decouples label-balanced data re-sampling from the curse of attributes co-occurrence, i.e., we equalize the sampling prior of an attribute while not biasing that of the co-occurred others. To diversify the attributes semantics and mitigate the feature noise, we propose a Bayesian feature augmentation method to introduce true in-distribution novelty. Handling both imbalances jointly, our work achieves best accuracy on various popular benchmarks, and importantly, with minimal computational budget.

CVSep 7, 2025
Compression Beyond Pixels: Semantic Compression with Multimodal Foundation Models

Ruiqi Shen, Haotian Wu, Wenjing Zhang et al.

Recent deep learning-based methods for lossy image compression achieve competitive rate-distortion performance through extensive end-to-end training and advanced architectures. However, emerging applications increasingly prioritize semantic preservation over pixel-level reconstruction and demand robust performance across diverse data distributions and downstream tasks. These challenges call for advanced semantic compression paradigms. Motivated by the zero-shot and representational capabilities of multimodal foundation models, we propose a novel semantic compression method based on the contrastive language-image pretraining (CLIP) model. Rather than compressing images for reconstruction, we propose compressing the CLIP feature embeddings into minimal bits while preserving semantic information across different tasks. Experiments show that our method maintains semantic integrity across benchmark datasets, achieving an average bit rate of approximately 2-3* 10(-3) bits per pixel. This is less than 5% of the bitrate required by mainstream image compression approaches for comparable performance. Remarkably, even under extreme compression, the proposed approach exhibits zero-shot robustness across diverse data distributions and downstream tasks.

CLAug 5, 2025
Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models

Haotian Wu, Bo Xu, Yao Shu et al.

Reasoning large language models (RLLMs) have recently demonstrated remarkable capabilities through structured and multi-step reasoning. While prior research has primarily focused on improving their training and inference strategies, their potential for in-context learning (ICL) remains largely underexplored. To fill this gap, we propose Thinking with Nothinking Calibration (JointThinking), a new ICL paradigm that prompts the model to generate two answers in parallel: one in Thinking mode and the other in Nothinking mode. A second round of Thinking is triggered only when the two initial responses are inconsistent, using a single prompt with two different answers. Extensive experiments across multiple reasoning benchmarks demonstrate that JointThinking significantly outperforms few-shot chain-of-thought (CoT), thinking twice and majority voting. Moreover, it achieves comparable in-distribution performance to training-based SOTA reasoning method, while substantially outperforming on out-of-distribution tasks. We further conduct a systematic analysis of the calibration mechanism, showing the importance of structural thinking diversity and the benefits of consistency check. Additionally, we observe that the performance gap between actual and ideal reasoning narrows as model size increases in the second thinking, indicating the strong scalability of our approach. Finally, we discuss current limitations and outline promising directions for future ICL research in RLLMs.

CLFeb 13, 2025
Multi-level Conflict-Aware Network for Multi-modal Sentiment Analysis

Yubo Gao, Haotian Wu, Lei Zhang

Multimodal Sentiment Analysis (MSA) aims to recognize human emotions by exploiting textual, acoustic, and visual modalities, and thus how to make full use of the interactions between different modalities is a central challenge of MSA. Interaction contains alignment and conflict aspects. Current works mainly emphasize alignment and the inherent differences between unimodal modalities, neglecting the fact that there are also potential conflicts between bimodal combinations. Additionally, multi-task learning-based conflict modeling methods often rely on the unstable generated labels. To address these challenges, we propose a novel multi-level conflict-aware network (MCAN) for multimodal sentiment analysis, which progressively segregates alignment and conflict constituents from unimodal and bimodal representations, and further exploits the conflict constituents with the conflict modeling branch. In the conflict modeling branch, we conduct discrepancy constraints at both the representation and predicted output levels, avoiding dependence on the generated labels. Experimental results on the CMU-MOSI and CMU-MOSEI datasets demonstrate the effectiveness of the proposed MCAN.

AIJan 21
Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge

Yiyang Feng, Zeming Chen, Haotian Wu et al.

A common solution for mitigating outdated or incorrect information in Large Language Models (LLMs) is to provide updated facts in-context or through knowledge editing. However, these methods introduce knowledge conflicts when the knowledge update fails to overwrite the model's parametric knowledge, which propagate to faulty reasoning. Current benchmarks for this problem, however, largely focus only on single knowledge updates and fact recall without evaluating how these updates affect downstream reasoning. In this work, we introduce TRACK (Testing Reasoning Amid Conflicting Knowledge), a new benchmark for studying how LLMs propagate new knowledge through multi-step reasoning when it conflicts with the model's initial parametric knowledge. Spanning three reasoning-intensive scenarios (WIKI, CODE, and MATH), TRACK introduces multiple, realistic conflicts to mirror real-world complexity. Our results on TRACK reveal that providing updated facts to models for reasoning can worsen performance compared to providing no updated facts to a model, and that this performance degradation exacerbates as more updated facts are provided. We show this failure stems from both inability to faithfully integrate updated facts, but also flawed reasoning even when knowledge is integrated. TRACK provides a rigorous new benchmark to measure and guide future progress on propagating conflicting knowledge in multi-step reasoning.

CLOct 8, 2025
FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

Haotian Wu, Shufan Jiang, Mingyu Chen et al.

As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character's responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.

AISep 30, 2025
Interactive Learning for LLM Reasoning

Hehai Lin, Shilei Cao, Sudong Wang et al.

Existing multi-agent learning approaches have developed interactive training environments to explicitly promote collaboration among multiple Large Language Models (LLMs), thereby constructing stronger multi-agent systems (MAS). However, during inference, they require re-executing the MAS to obtain final solutions, which diverges from human cognition that individuals can enhance their reasoning capabilities through interactions with others and resolve questions independently in the future. To investigate whether multi-agent interaction can enhance LLMs' independent problem-solving ability, we introduce ILR, a novel co-learning framework for MAS that integrates two key components: Dynamic Interaction and Perception Calibration. Specifically, Dynamic Interaction first adaptively selects either cooperative or competitive strategies depending on question difficulty and model ability. LLMs then exchange information through Idea3 (Idea Sharing, Idea Analysis, and Idea Fusion), an innovative interaction paradigm designed to mimic human discussion, before deriving their respective final answers. In Perception Calibration, ILR employs Group Relative Policy Optimization (GRPO) to train LLMs while integrating one LLM's reward distribution characteristics into another's reward function, thereby enhancing the cohesion of multi-agent interactions. We validate ILR on three LLMs across two model families of varying scales, evaluating performance on five mathematical benchmarks and one coding benchmark. Experimental results show that ILR consistently outperforms single-agent learning, yielding an improvement of up to 5% over the strongest baseline. We further discover that Idea3 can enhance the robustness of stronger LLMs during multi-agent inference, and dynamic interaction types can boost multi-agent learning compared to pure cooperative or competitive strategies.

CVJul 31, 2025
Learning Semantic Directions for Feature Augmentation in Domain-Generalized Medical Segmentation

Yingkai Wang, Yaoyao Zhu, Xiuding Cai et al.

Medical image segmentation plays a crucial role in clinical workflows, but domain shift often leads to performance degradation when models are applied to unseen clinical domains. This challenge arises due to variations in imaging conditions, scanner types, and acquisition protocols, limiting the practical deployment of segmentation models. Unlike natural images, medical images typically exhibit consistent anatomical structures across patients, with domain-specific variations mainly caused by imaging conditions. This unique characteristic makes medical image segmentation particularly challenging. To address this challenge, we propose a domain generalization framework tailored for medical image segmentation. Our approach improves robustness to domain-specific variations by introducing implicit feature perturbations guided by domain statistics. Specifically, we employ a learnable semantic direction selector and a covariance-based semantic intensity sampler to modulate domain-variant features while preserving task-relevant anatomical consistency. Furthermore, we design an adaptive consistency constraint that is selectively applied only when feature adjustment leads to degraded segmentation performance. This constraint encourages the adjusted features to align with the original predictions, thereby stabilizing feature selection and improving the reliability of the segmentation. Extensive experiments on two public multi-center benchmarks show that our framework consistently outperforms existing domain generalization approaches, achieving robust and generalizable segmentation performance across diverse clinical domains.

SPMay 30, 2023
A Graph Reconstruction by Dynamic Signal Coefficient for Fault Classification

Wenbin He, Jianxu Mao, Yaonan Wang et al.

To improve the performance in identifying the faults under strong noise for rotating machinery, this paper presents a dynamic feature reconstruction signal graph method, which plays the key role of the proposed end-to-end fault diagnosis model. Specifically, the original mechanical signal is first decomposed by wavelet packet decomposition (WPD) to obtain multiple subbands including coefficient matrix. Then, with originally defined two feature extraction factors MDD and DDD, a dynamic feature selection method based on L2 energy norm (DFSL) is proposed, which can dynamically select the feature coefficient matrix of WPD based on the difference in the distribution of norm energy, enabling each sub-signal to take adaptive signal reconstruction. Next the coefficient matrices of the optimal feature sub-bands are reconstructed and reorganized to obtain the feature signal graphs. Finally, deep features are extracted from the feature signal graphs by 2D-Convolutional neural network (2D-CNN). Experimental results on a public data platform of a bearing and our laboratory platform of robot grinding show that this method is better than the existing methods under different noise intensities.

TRApr 21, 2021
Cyclic Arbitrage in Decentralized Exchanges

Ye Wang, Yan Chen, Haotian Wu et al.

Decentralized Exchanges (DEXes) enable users to create markets for exchanging any pair of cryptocurrencies. The direct exchange rate of two tokens may not match the cross-exchange rate in the market, and such price discrepancies open up arbitrage possibilities with trading through different cryptocurrencies cyclically. In this paper, we conduct a systematic investigation on cyclic arbitrages in DEXes. We propose a theoretical framework for studying cyclic arbitrage. With our framework, we analyze the profitability conditions and optimal trading strategies of cyclic transactions. We further examine exploitable arbitrage opportunities and the market size of cyclic arbitrages with transaction-level data of Uniswap V2. We find that traders have executed 292,606 cyclic arbitrages over eleven months and exploited more than 138 million USD in revenue. However, the revenue of the most profitable unexploited opportunity is persistently higher than 1 ETH (4,000 USD), which indicates that DEX markets may not be efficient enough. By analyzing how traders implement cyclic arbitrages, we find that traders can utilize smart contracts to issue atomic transactions and the atomic implementations could mitigate users' financial loss in cyclic arbitrage from the price impact.

CVJun 11, 2020
Minimum Potential Energy of Point Cloud for Robust Global Registration

Zijie Wu, Yaonan Wang, Qing Zhu et al.

In this paper, we propose a novel minimum gravitational potential energy (MPE)-based algorithm for global point set registration. The feature descriptors extraction algorithms have emerged as the standard approach to align point sets in the past few decades. However, the alignment can be challenging to take effect when the point set suffers from raw point data problems such as noises (Gaussian and Uniformly). Different from the most existing point set registration methods which usually extract the descriptors to find correspondences between point sets, our proposed MPE alignment method is able to handle large scale raw data offset without depending on traditional descriptors extraction, whether for the local or global registration methods. We decompose the solution into a global optimal convex approximation and the fast descent process to a local minimum. For the approximation step, the proposed minimum potential energy (MPE) approach consists of two main steps. Firstly, according to the construction of the force traction operator, we could simply compute the position of the potential energy minimum; Secondly, with respect to the finding of the MPE point, we propose a new theory that employs the two flags to observe the status of the registration procedure. The method of fast descent process to the minimum that we employed is the iterative closest point algorithm; it can achieve the global minimum. We demonstrate the performance of the proposed algorithm on synthetic data as well as on real data. The proposed method outperforms the other global methods in terms of both efficiency, accuracy and noise resistance.