LGMay 30Code
ProjQ: Project-and-Quantize for Adapter-Aware LLM CompressionWneya Yu, Chao Zhang, Li Wang et al.
Post-Training Quantization (PTQ) and Low-Rank Adaptation (LoRA) constitute the standard pipeline for efficient Large Language Model (LLM) deployment. However, applying them sequentially poses a problem: PTQ often leaves behind random noise that is spread out (across the model's weights) in a way LoRA can't easily fix, meaning that LoRA ends up wasting its limited capacity trying to fix uncorrectable noise instead of improving task performance. In this paper, we propose \textbf{ProjQ}, a novel framework for constraining quantization noise to the low-rank manifold via orthogonal subspace projection. We derive an efficient alternating algorithm that shapes the quantization noise into a low-rank structure, effectively offloading dominant error components to the subsequent adapter while minimizing the residual error in the orthogonal "uncorrectable" subspace. Our theoretical analysis demonstrates that ProjQ preserves strictly greater model plasticity for downstream tasks compared to standard PTQ. Extensive experiments on LLaMA-2, Qwen2.5 and Qwen3 confirm that ProjQ consistently outperforms existing methods in both quantization error compensation and downstream task fine-tuning, achieving up to $2\times$ lower evaluation loss for compensation and matching the performance of standard 4-bit baselines on language modeling tasks with only 3 bits. The code is available on https://github.com/yy9301/ProjQ .
CVApr 13Code
The Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and ResultsXingyu Qiu, Yuqian Fu, Jiawei Geng et al.
Cross-domain few-shot object detection (CD-FSOD) remains a challenging problem for existing object detectors and few-shot learning approaches, particularly when generalizing across distinct domains. As part of NTIRE 2026, we hosted the second CD-FSOD Challenge to systematically evaluate and promote progress in detecting objects in unseen target domains under limited annotation conditions. The challenge received strong community interest, with 128 registered participants and a total of 696 submissions. Among them, 31 teams actively participated, and 19 teams submitted valid final results. Participants explored a wide range of strategies, introducing innovative methods that push the performance frontier under both open-source and closed-source tracks. This report presents a detailed overview of the NTIRE 2026 CD-FSOD Challenge, including a summary of the submitted approaches and an analysis of the final results across all participating teams. Challenge Codes: https://github.com/ohMargin/NTIRE2026_CDFSOD.
CVMay 11, 2022
NTIRE 2022 Challenge on Efficient Super-Resolution: Methods and ResultsYawei Li, Kai Zhang, Radu Timofte et al. · eth-zurich, tencent-ai
This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single image super-resolution that achieved improvement of efficiency measured according to several metrics including runtime, parameters, FLOPs, activations, and memory consumption while at least maintaining the PSNR of 29.00dB on DIV2K validation set. IMDN is set as the baseline for efficiency measurement. The challenge had 3 tracks including the main track (runtime), sub-track one (model complexity), and sub-track two (overall performance). In the main track, the practical runtime performance of the submissions was evaluated. The rank of the teams were determined directly by the absolute value of the average runtime on the validation set and test set. In sub-track one, the number of parameters and FLOPs were considered. And the individual rankings of the two metrics were summed up to determine a final ranking in this track. In sub-track two, all of the five metrics mentioned in the description of the challenge including runtime, parameter count, FLOPs, activations, and memory consumption were considered. Similar to sub-track one, the rankings of five metrics were summed up to determine a final ranking. The challenge had 303 registered participants, and 43 teams made valid submissions. They gauge the state-of-the-art in efficient single image super-resolution.
CVMar 15, 2023Code
BEVHeight: A Robust Framework for Vision-based Roadside 3D Object DetectionLei Yang, Kaicheng Yu, Tao Tang et al.
While most recent autonomous driving system focuses on developing perception methods on ego-vehicle sensors, people tend to overlook an alternative approach to leverage intelligent roadside cameras to extend the perception ability beyond the visual range. We discover that the state-of-the-art vision-centric bird's eye view detection methods have inferior performances on roadside cameras. This is because these methods mainly focus on recovering the depth regarding the camera center, where the depth difference between the car and the ground quickly shrinks while the distance increases. In this paper, we propose a simple yet effective approach, dubbed BEVHeight, to address this issue. In essence, instead of predicting the pixel-wise depth, we regress the height to the ground to achieve a distance-agnostic formulation to ease the optimization process of camera-only perception methods. On popular 3D detection benchmarks of roadside cameras, our method surpasses all previous vision-centric methods by a significant margin. The code is available at {\url{https://github.com/ADLab-AutoDrive/BEVHeight}}.
CVJul 10, 2022Code
Mix-Teaching: A Simple, Unified and Effective Semi-Supervised Learning Framework for Monocular 3D Object DetectionLei Yang, Xinyu Zhang, Li Wang et al.
Monocular 3D object detection is an essential perception task for autonomous driving. However, the high reliance on large-scale labeled data make it costly and time-consuming during model optimization. To reduce such over-reliance on human annotations, we propose Mix-Teaching, an effective semi-supervised learning framework applicable to employ both labeled and unlabeled images in training stage. Mix-Teaching first generates pseudo-labels for unlabeled images by self-training. The student model is then trained on the mixed images possessing much more intensive and precise labeling by merging instance-level image patches into empty backgrounds or labeled images. This is the first to break the image-level limitation and put high-quality pseudo labels from multi frames into one image for semi-supervised training. Besides, as a result of the misalignment between confidence score and localization quality, it's hard to discriminate high-quality pseudo-labels from noisy predictions using only confidence-based criterion. To that end, we further introduce an uncertainty-based filter to help select reliable pseudo boxes for the above mixing operation. To the best of our knowledge, this is the first unified SSL framework for monocular 3D object detection. Mix-Teaching consistently improves MonoFlex and GUPNet by significant margins under various labeling ratios on KITTI dataset. For example, our method achieves around +6.34% AP@0.7 improvement against the GUPNet baseline on validation set when using only 10% labeled data. Besides, by leveraging full training set and the additional 48K raw images of KITTI, it can further improve the MonoFlex by +4.65% improvement on AP@0.7 for car detection, reaching 18.54% AP@0.7, which ranks the 1st place among all monocular based methods on KITTI test leaderboard. The code and pretrained models will be released at https://github.com/yanglei18/Mix-Teaching.
CRAug 18, 2022Code
Private, Efficient, and Accurate: Protecting Models Trained by Multi-party Learning with Differential PrivacyWenqiang Ruan, Mingxin Xu, Wenjing Fang et al.
Secure multi-party computation-based machine learning, referred to as MPL, has become an important technology to utilize data from multiple parties with privacy preservation. While MPL provides rigorous security guarantees for the computation process, the models trained by MPL are still vulnerable to attacks that solely depend on access to the models. Differential privacy could help to defend against such attacks. However, the accuracy loss brought by differential privacy and the huge communication overhead of secure multi-party computation protocols make it highly challenging to balance the 3-way trade-off between privacy, efficiency, and accuracy. In this paper, we are motivated to resolve the above issue by proposing a solution, referred to as PEA (Private, Efficient, Accurate), which consists of a secure DPSGD protocol and two optimization methods. First, we propose a secure DPSGD protocol to enforce DPSGD in secret sharing-based MPL frameworks. Second, to reduce the accuracy loss led by differential privacy noise and the huge communication overhead of MPL, we propose two optimization methods for the training process of MPL: (1) the data-independent feature extraction method, which aims to simplify the trained model structure; (2) the local data-based global model initialization method, which aims to speed up the convergence of the model training. We implement PEA in two open-source MPL frameworks: TF-Encrypted and Queqiao. The experimental results on various datasets demonstrate the efficiency and effectiveness of PEA. E.g. when $ε$ = 2, we can train a differentially private classification model with an accuracy of 88% for CIFAR-10 within 7 minutes under the LAN setting. This result significantly outperforms the one from CryptGPU, one SOTA MPL framework: it costs more than 16 hours to train a non-private deep neural network model on CIFAR-10 with the same accuracy.
CVOct 11, 2023Code
Dual Radar: A Multi-modal Dataset with Dual 4D Radar for Autonomous DrivingXinyu Zhang, Li Wang, Jian Chen et al.
Radar has stronger adaptability in adverse scenarios for autonomous driving environmental perception compared to widely adopted cameras and LiDARs. Compared with commonly used 3D radars, the latest 4D radars have precise vertical resolution and higher point cloud density, making it a highly promising sensor for autonomous driving in complex environmental perception. However, due to the much higher noise than LiDAR, manufacturers choose different filtering strategies, resulting in an inverse ratio between noise level and point cloud density. There is still a lack of comparative analysis on which method is beneficial for deep learning-based perception algorithms in autonomous driving. One of the main reasons is that current datasets only adopt one type of 4D radar, making it difficult to compare different 4D radars in the same scene. Therefore, in this paper, we introduce a novel large-scale multi-modal dataset featuring, for the first time, two types of 4D radars captured simultaneously. This dataset enables further research into effective 4D radar perception algorithms.Our dataset consists of 151 consecutive series, most of which last 20 seconds and contain 10,007 meticulously synchronized and annotated frames. Moreover, our dataset captures a variety of challenging driving scenarios, including many road conditions, weather conditions, nighttime and daytime with different lighting intensities and periods. Our dataset annotates consecutive frames, which can be applied to 3D object detection and tracking, and also supports the study of multi-modal tasks. We experimentally validate our dataset, providing valuable results for studying different types of 4D radars. This dataset is released on https://github.com/adept-thu/Dual-Radar.
AIMay 27
ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response ReplayZhexin Hu, Li Wang, Xiaohan Wang et al.
Adaptive context compression is vital for scaling Large Language Models (LLMs) to complex, multi-turn agent tasks. However, rule-based compression methods may discard task-critical nuances, while Reinforcement Learning (RL) approaches usually struggle to balance information retention and token efficiency under the sparse rewards inherent to long-horizon workflows. To bridge this gap, we propose ZipRL, a novel adaptive compression framework tailored for Reinforcement Learning from Verifiable Rewards (RLVR). ZipRL features a multi-granularity compression mechanism for active, non-uniform information reduction, coupled with Hindsight Response Replay (HRR), a technique designed to densify training signals during RLVR optimization. Theoretically, we prove ZipRL's superior task-relevant utility over uniform methods. Concretely, ZipRL utilizes coarse-to-fine prompts for macro-compression and incorporates HRR into GRPO via generalized advantage reshaping. Multiple models of varying versions and parameter scales validate the effectiveness of our approach. Benchmarks on five agent tasks show ZipRL outperforms state-of-the-art approaches by 27.9% and 34.7% across Qwen3-4B and Qwen3-8B models, while maintaining exceptional token efficiency and robustness under extreme 256-turn extrapolation stress tests.
CLMay 26Code
Learning to Adapt SFT Data for Better Reasoning GeneralizationLisong Sun, Li Wang, Chen Zhang et al.
Large language models (LLMs) have achieved remarkable progress, with post-training playing a crucial role in enhancing their reasoning capabilities. Among post-training paradigms, supervised fine-tuning (SFT) is widely used: it leverages external data to provide dense supervision and enables efficient training. However, directly fine-tuning on expert data can hurt generalization when the data distribution is mismatched with the target model's own distribution. In this work, we propose Data Adaptation for Reasoning Tuning (DART), which formulates the use of a fixed, potentially distributionally misaligned SFT dataset as an optimization problem over demonstration transformations. DART trains a mapper model with reinforcement learning to convert original SFT data into model-adapted supervision that better matches the target model's distribution and learning preferences. The transformed data are then used for SFT, allowing the target model to better exploit external supervision. Experiments across multiple models and datasets show that DART improves generalization, achieves higher training efficiency than direct RL, and helps models surpass standard SFT. Our code is available at https://anonymous.4open.science/r/DART525E50D.
LGMay 25Code
When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable RewardsLi Wang, Xiaodong Lu, Xiaohan Wang et al.
Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often prohibitively expensive in real-world scenarios. While unsupervised RLVR paradigms attempt to circumvent this by training on pseudo-labels, they are notoriously susceptible to training collapse. Moreover, different samples often exhibit varying annotation values. In this paper, we propose Reinforcement Learning with Active Verifiable Rewards (RLAVR), which actively acquires ground-truth labels for a small set of selected samples and integrates them with pseudo-labels, thereby stabilizing training dynamics and improving performance under limited annotation budgets. To identify valuable samples, we propose the Corrective Advantage Gap (CAG) metric and analyze the sample-level supervision value. Building on this, we introduce Correction-Aware Reliability Estimation for RLAVR (CARE), which translates the oracle CAG criterion into a practical pre-query acquisition policy to substantially improve training stability. Extensive experiments across diverse domains, model families, and model scales demonstrate the effectiveness and generality of our approach. Our code is available at https://github.com/Lumina04/CARE.
NANov 7, 2016
Positivity-preserving and asymptotic preserving method for 2D Keller-Segal equationsJian-Guo Liu, Li Wang, Zhennan Zhou
We propose a semi-discrete scheme for 2D Keller-Segel equations based on a symmetrization reformation, which is equivalent to the convex splitting method and is free of any nonlinear solver. We show that, this new scheme is unconditionally stable as long as the initial condition does not exceed certain threshold, and it asymptotically preserves the quasi-static limit in the transient regime. Furthermore, we prove that the fully discrete scheme is conservative and positivity preserving, which makes it ideal for simulations. The analogical schemes for the radial symmetric cases and the subcritical degenerate cases are also presented and analyzed. With extensive numerical tests, we verify the claimed properties of the methods and demonstrate their superiority in various challenging applications.
SPJul 10, 2024
Generative AI for RF Sensing in IoT systemsLi Wang, Chao Zhang, Qiyang Zhao et al.
The development of wireless sensing technologies, using signals such as Wi-Fi, infrared, and RF to gather environmental data, has significantly advanced within Internet of Things (IoT) systems. Among these, Radio Frequency (RF) sensing stands out for its cost-effective and non-intrusive monitoring of human activities and environmental changes. However, traditional RF sensing methods face significant challenges, including noise, interference, incomplete data, and high deployment costs, which limit their effectiveness and scalability. This paper investigates the potential of Generative AI (GenAI) to overcome these limitations within the IoT ecosystem. We provide a comprehensive review of state-of-the-art GenAI techniques, focusing on their application to RF sensing problems. By generating high-quality synthetic data, enhancing signal quality, and integrating multi-modal data, GenAI offers robust solutions for RF environment reconstruction, localization, and imaging. Additionally, GenAI's ability to generalize enables IoT devices to adapt to new environments and unseen tasks, improving their efficiency and performance. The main contributions of this article include a detailed analysis of the challenges in RF sensing, the presentation of innovative GenAI-based solutions, and the proposal of a unified framework for diverse RF sensing tasks. Through case studies, we demonstrate the effectiveness of integrating GenAI models, leading to advanced, scalable, and intelligent IoT systems.
CVApr 20, 2023
NeUDF: Leaning Neural Unsigned Distance Fields with Volume RenderingYu-Tao Liu, Li Wang, Jie yang et al.
Multi-view shape reconstruction has achieved impressive progresses thanks to the latest advances in neural implicit surface rendering. However, existing methods based on signed distance function (SDF) are limited to closed surfaces, failing to reconstruct a wide range of real-world objects that contain open-surface structures. In this work, we introduce a new neural rendering framework, coded NeUDF, that can reconstruct surfaces with arbitrary topologies solely from multi-view supervision. To gain the flexibility of representing arbitrary surfaces, NeUDF leverages the unsigned distance function (UDF) as surface representation. While a naive extension of an SDF-based neural renderer cannot scale to UDF, we propose two new formulations of weight function specially tailored for UDF-based volume rendering. Furthermore, to cope with open surface rendering, where the in/out test is no longer valid, we present a dedicated normal regularization strategy to resolve the surface orientation ambiguity. We extensively evaluate our method over a number of challenging datasets, including DTU}, MGN, and Deep Fashion 3D. Experimental results demonstrate that nEudf can significantly outperform the state-of-the-art method in the task of multi-view surface reconstruction, especially for complex shapes with open boundaries.
CVMay 20Code
DrawMotion: Generating 3D Human Motions by Freehand DrawingTao Wang, Lei Jin, Zhihua Wu et al.
Text-to-motion generation, which translates textual descriptions into human motions, faces the challenge that users often struggle to precisely convey their intended motions through text alone. To address this issue, this paper introduces DrawMotion, an efficient diffusion-based framework designed for multi-condition scenarios. DrawMotion generates motions based on both a conventional text condition and a novel hand-drawing condition, which provide semantic and spatial control over the generated motions, respectively. Specifically, we tackle the fine-grained motion generation task from three perspectives: 1) freehand drawing condition. To accurately capture users' intended motions without requiring tedious textual input, we develop an algorithm to automatically generate hand-drawn stickman sketches across different dataset formats; 2) multi-condition fusion. We propose a Multi-Condition Module (MCM) that is integrated into the diffusion process, enabling the model to exploit all possible condition combinations while reducing computational complexity compared to conventional approaches; and 3) training-free guidance. Notably, the MCM in DrawMotion ensures that its intermediate features lie in a continuous space, allowing classifier-guidance gradients to update the features and thereby aligning the generated motions with user intentions while preserving fidelity. Quantitative experiments and user studies demonstrate that the freehand drawing approach reduces user time by approximately 46.7% when generating motions aligned with their imagination. The code, demos, and relevant data are publicly available at https://github.com/InvertedForest/DrawMotion.
CVSep 26, 2024Code
Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEsQinpeng Cui, Yixuan Liu, Xinyi Zhang et al.
Diffusion-based image super-resolution (SR) models have attracted substantial interest due to their powerful image restoration capabilities. However, prevailing diffusion models often struggle to strike an optimal balance between efficiency and performance. Typically, they either neglect to exploit the potential of existing extensive pretrained models, limiting their generative capacity, or they necessitate a dozens of forward passes starting from random noises, compromising inference efficiency. In this paper, we present DoSSR, a Domain Shift diffusion-based SR model that capitalizes on the generative powers of pretrained diffusion models while significantly enhancing efficiency by initiating the diffusion process with low-resolution (LR) images. At the core of our approach is a domain shift equation that integrates seamlessly with existing diffusion models. This integration not only improves the use of diffusion prior but also boosts inference efficiency. Moreover, we advance our method by transitioning the discrete shift process to a continuous formulation, termed as DoS-SDEs. This advancement leads to the fast and customized solvers that further enhance sampling efficiency. Empirical results demonstrate that our proposed method achieves state-of-the-art performance on synthetic and real-world datasets, while notably requiring only 5 sampling steps. Compared to previous diffusion prior based methods, our approach achieves a remarkable speedup of 5-7 times, demonstrating its superior efficiency. Code: https://github.com/QinpengCui/DoSSR.
SYSep 15, 2023
A Stochastic Online Forecast-and-Optimize Framework for Real-Time Energy Dispatch in Virtual Power Plants under UncertaintyWei Jiang, Zhongkai Yi, Li Wang et al.
Aggregating distributed energy resources in power systems significantly increases uncertainties, in particular caused by the fluctuation of renewable energy generation. This issue has driven the necessity of widely exploiting advanced predictive control techniques under uncertainty to ensure long-term economics and decarbonization. In this paper, we propose a real-time uncertainty-aware energy dispatch framework, which is composed of two key elements: (i) A hybrid forecast-and-optimize sequential task, integrating deep learning-based forecasting and stochastic optimization, where these two stages are connected by the uncertainty estimation at multiple temporal resolutions; (ii) An efficient online data augmentation scheme, jointly involving model pre-training and online fine-tuning stages. In this way, the proposed framework is capable to rapidly adapt to the real-time data distribution, as well as to target on uncertainties caused by data drift, model discrepancy and environment perturbations in the control process, and finally to realize an optimal and robust dispatch solution. The proposed framework won the championship in CityLearn Challenge 2022, which provided an influential opportunity to investigate the potential of AI application in the energy domain. In addition, comprehensive experiments are conducted to interpret its effectiveness in the real-life scenario of smart building energy management.
CVSep 6, 2022
CAMO-MOT: Combined Appearance-Motion Optimization for 3D Multi-Object Tracking with Camera-LiDAR FusionLi Wang, Xinyu Zhang, Wenyuan Qin et al.
3D Multi-object tracking (MOT) ensures consistency during continuous dynamic detection, conducive to subsequent motion planning and navigation tasks in autonomous driving. However, camera-based methods suffer in the case of occlusions and it can be challenging to accurately track the irregular motion of objects for LiDAR-based methods. Some fusion methods work well but do not consider the untrustworthy issue of appearance features under occlusion. At the same time, the false detection problem also significantly affects tracking. As such, we propose a novel camera-LiDAR fusion 3D MOT framework based on the Combined Appearance-Motion Optimization (CAMO-MOT), which uses both camera and LiDAR data and significantly reduces tracking failures caused by occlusion and false detection. For occlusion problems, we are the first to propose an occlusion head to select the best object appearance features multiple times effectively, reducing the influence of occlusions. To decrease the impact of false detection in tracking, we design a motion cost matrix based on confidence scores which improve the positioning and object prediction accuracy in 3D space. As existing multi-object tracking methods only consider a single category, we also propose to build a multi-category loss to implement multi-object tracking in multi-category scenes. A series of validation experiments are conducted on the KITTI and nuScenes tracking benchmarks. Our proposed method achieves state-of-the-art performance and the lowest identity switches (IDS) value (23 for Car and 137 for Pedestrian) among all multi-modal MOT methods on the KITTI test dataset. And our proposed method achieves state-of-the-art performance among all algorithms on the nuScenes test dataset with 75.3% AMOTA.
CLJan 17, 2023
BERT-ERC: Fine-tuning BERT is Enough for Emotion Recognition in ConversationXiangyu Qin, Zhiyu Wu, Jinshi Cui et al.
Previous works on emotion recognition in conversation (ERC) follow a two-step paradigm, which can be summarized as first producing context-independent features via fine-tuning pretrained language models (PLMs) and then analyzing contextual information and dialogue structure information among the extracted features. However, we discover that this paradigm has several limitations. Accordingly, we propose a novel paradigm, i.e., exploring contextual information and dialogue structure information in the fine-tuning step, and adapting the PLM to the ERC task in terms of input text, classification structure, and training strategy. Furthermore, we develop our model BERT-ERC according to the proposed paradigm, which improves ERC performance in three aspects, namely suggestive text, fine-grained classification module, and two-stage training. Compared to existing methods, BERT-ERC achieves substantial improvement on four datasets, indicating its effectiveness and generalization capability. Besides, we also set up the limited resources scenario and the online prediction scenario to approximate real-world scenarios. Extensive experiments demonstrate that the proposed paradigm significantly outperforms the previous one and can be adapted to various scenes.
APFeb 2, 2018
Analysis and computation of some tumor growth models with nutrient: from cell density models to free boundary dynamicsJian-Guo Liu, Min Tang, Li Wang et al.
In this paper, we study the tumor growth equation along with various models for the nutrient component, including the \emph{in vitro} model and the \emph{in vivo} model. At the cell density level, the spatial availability of the tumor density $n$ is governed by the Darcy law via the pressure $p(n)=n^γ$. For finite $γ$, we prove some a priori estimates of the tumor growth model, such as boundedness of the nutrient density, and non-negativity and growth estimate of the tumor density. As $γ\rightarrow \infty$, the cell density models formally converge to Hele-Shaw flow models, which determine the free boundary dynamics of the tumor tissue in the incompressible limit. We derive several analytical solutions to the Hele-Shaw flow models, which serve as benchmark solutions to the geometric motion of tumor front propagation. Finally, we apply a conservative and positivity preserving numerical scheme to the cell density models, with numerical results verifying the link between cell density models and the free boundary dynamical models.
CVSep 28, 2023
BEVHeight++: Toward Robust Visual Centric 3D Object DetectionLei Yang, Tao Tang, Jun Li et al.
While most recent autonomous driving system focuses on developing perception methods on ego-vehicle sensors, people tend to overlook an alternative approach to leverage intelligent roadside cameras to extend the perception ability beyond the visual range. We discover that the state-of-the-art vision-centric bird's eye view detection methods have inferior performances on roadside cameras. This is because these methods mainly focus on recovering the depth regarding the camera center, where the depth difference between the car and the ground quickly shrinks while the distance increases. In this paper, we propose a simple yet effective approach, dubbed BEVHeight++, to address this issue. In essence, we regress the height to the ground to achieve a distance-agnostic formulation to ease the optimization process of camera-only perception methods. By incorporating both height and depth encoding techniques, we achieve a more accurate and robust projection from 2D to BEV spaces. On popular 3D detection benchmarks of roadside cameras, our method surpasses all previous vision-centric methods by a significant margin. In terms of the ego-vehicle scenario, our BEVHeight++ possesses superior over depth-only methods. Specifically, it yields a notable improvement of +1.9% NDS and +1.1% mAP over BEVDepth when evaluated on the nuScenes validation set. Moreover, on the nuScenes test set, our method achieves substantial advancements, with an increase of +2.8% NDS and +1.7% mAP, respectively.
CVJul 12, 2023
Early Autism Diagnosis based on Path Signature and Siamese Unsupervised Feature CompressorZhuowen Yin, Xinyao Ding, Xin Zhang et al.
Autism Spectrum Disorder (ASD) has been emerging as a growing public health threat. Early diagnosis of ASD is crucial for timely, effective intervention and treatment. However, conventional diagnosis methods based on communications and behavioral patterns are unreliable for children younger than 2 years of age. Given evidences of neurodevelopmental abnormalities in ASD infants, we resort to a novel deep learning-based method to extract key features from the inherently scarce, class-imbalanced, and heterogeneous structural MR images for early autism diagnosis. Specifically, we propose a Siamese verification framework to extend the scarce data, and an unsupervised compressor to alleviate data imbalance by extracting key features. We also proposed weight constraints to cope with sample heterogeneity by giving different samples different voting weights during validation, and we used Path Signature to unravel meaningful developmental features from the two-time point data longitudinally. We further extracted machine learning focused brain regions for autism diagnosis. Extensive experiments have shown that our method performed well under practical scenarios, transcending existing machine learning methods and providing anatomical insights for autism early diagnosis.
IVOct 20, 2023Code
Progressive Dual Priori Network for Generalized Breast Tumor SegmentationLi Wang, Lihui Wang, Zixiang Kuai et al.
To promote the generalization ability of breast tumor segmentation models, as well as to improve the segmentation performance for breast tumors with smaller size, low-contrast and irregular shape, we propose a progressive dual priori network (PDPNet) to segment breast tumors from dynamic enhanced magnetic resonance images (DCE-MRI) acquired at different centers. The PDPNet first cropped tumor regions with a coarse-segmentation based localization module, then the breast tumor mask was progressively refined by using the weak semantic priori and cross-scale correlation prior knowledge. To validate the effectiveness of PDPNet, we compared it with several state-of-the-art methods on multi-center datasets. The results showed that, comparing against the suboptimal method, the DSC and HD95 of PDPNet were improved at least by 5.13% and 7.58% respectively on multi-center test sets. In addition, through ablations, we demonstrated that the proposed localization module can decrease the influence of normal tissues and therefore improve the generalization ability of the model. The weak semantic priors allow focusing on tumor regions to avoid missing small tumors and low-contrast tumors. The cross-scale correlation priors are beneficial for promoting the shape-aware ability for irregular tumors. Thus integrating them in a unified framework improved the multi-center breast tumor segmentation performance. The source code and open data can be accessed at https://github.com/wangli100209/PDPNet.
IVJun 20, 2023
Brain Anatomy Prior Modeling to Forecast Clinical Progression of Cognitive Impairment with Structural MRILintao Zhang, Jinjian Wu, Lihong Wang et al.
Brain structural MRI has been widely used to assess the future progression of cognitive impairment (CI). Previous learning-based studies usually suffer from the issue of small-sized labeled training data, while there exist a huge amount of structural MRIs in large-scale public databases. Intuitively, brain anatomical structures derived from these public MRIs (even without task-specific label information) can be used to boost CI progression trajectory prediction. However, previous studies seldom take advantage of such brain anatomy prior. To this end, this paper proposes a brain anatomy prior modeling (BAPM) framework to forecast the clinical progression of cognitive impairment with small-sized target MRIs by exploring anatomical brain structures. Specifically, the BAPM consists of a pretext model and a downstream model, with a shared brain anatomy-guided encoder to model brain anatomy prior explicitly. Besides the encoder, the pretext model also contains two decoders for two auxiliary tasks (i.e., MRI reconstruction and brain tissue segmentation), while the downstream model relies on a predictor for classification. The brain anatomy-guided encoder is pre-trained with the pretext model on 9,344 auxiliary MRIs without diagnostic labels for anatomy prior modeling. With this encoder frozen, the downstream model is then fine-tuned on limited target MRIs for prediction. We validate the BAPM on two CI-related studies with T1-weighted MRIs from 448 subjects. Experimental results suggest the effectiveness of BAPM in (1) four CI progression prediction tasks, (2) MR image reconstruction, and (3) brain tissue segmentation, compared with several state-of-the-art methods.
CVSep 30, 2023
MonoGAE: Roadside Monocular 3D Object Detection with Ground-Aware EmbeddingsLei Yang, Jiaxin Yu, Xinyu Zhang et al.
Although the majority of recent autonomous driving systems concentrate on developing perception methods based on ego-vehicle sensors, there is an overlooked alternative approach that involves leveraging intelligent roadside cameras to help extend the ego-vehicle perception ability beyond the visual range. We discover that most existing monocular 3D object detectors rely on the ego-vehicle prior assumption that the optical axis of the camera is parallel to the ground. However, the roadside camera is installed on a pole with a pitched angle, which makes the existing methods not optimal for roadside scenes. In this paper, we introduce a novel framework for Roadside Monocular 3D object detection with ground-aware embeddings, named MonoGAE. Specifically, the ground plane is a stable and strong prior knowledge due to the fixed installation of cameras in roadside scenarios. In order to reduce the domain gap between the ground geometry information and high-dimensional image features, we employ a supervised training paradigm with a ground plane to predict high-dimensional ground-aware embeddings. These embeddings are subsequently integrated with image features through cross-attention mechanisms. Furthermore, to improve the detector's robustness to the divergences in cameras' installation poses, we replace the ground plane depth map with a novel pixel-level refined ground plane equation map. Our approach demonstrates a substantial performance advantage over all previous monocular 3D object detectors on widely recognized 3D detection benchmarks for roadside cameras. The code and pre-trained models will be released soon.
CLMay 18Code
Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical ReasoningLi Wang, Xiaohan Wang, Xiaodong Lu et al.
Large language models (LLMs) have increasingly leveraged tool invocation to enhance their reasoning capabilities. However, existing approaches typically tightly couple tool invocation with immediate execution. Such immediate tool interaction may disrupt the reasoning coherence of LLMs and constrain their expressivity, ultimately degrading reasoning performance. To this end, for the first time, we propose and formalize the problem of decoupling tool invocation from execution during reasoning, and introduce delayed execution with explicit control to enhance tool-integrated reasoning (TIR). Furthermore, we propose a hierarchical control framework and theoretically derive a surrogate loss that enables an implicitly hierarchical policy to learn behavior equivalent to that of an explicit hierarchical policy, leading to the proposed IH-GRPO algorithm. Extensive experiments on IH-GRPO achieve absolute improvements of 1.87\%, 2.16\%, and 2.53\% on Qwen3-1.7B, Qwen3-4B, and Qwen3-8B across six out-of-domain mathematical reasoning benchmarks over the strongest baseline method, while also yielding consistent performance gains in other domains. Our code is available at https://github.com/Lumina04/IH-GRPO-01.
LGDec 9, 2022
Transfer Learning Enhanced DeepONet for Long-Time Prediction of Evolution EquationsWuzhe Xu, Yulong Lu, Li Wang
Deep operator network (DeepONet) has demonstrated great success in various learning tasks, including learning solution operators of partial differential equations. In particular, it provides an efficient approach to predict the evolution equations in a finite time horizon. Nevertheless, the vanilla DeepONet suffers from the issue of stability degradation in the long-time prediction. This paper proposes a {\em transfer-learning} aided DeepONet to enhance the stability. Our idea is to use transfer learning to sequentially update the DeepONets as the surrogates for propagators learned in different time frames. The evolving DeepONets can better track the varying complexities of the evolution equations, while only need to be updated by efficient training of a tiny fraction of the operator networks. Through systematic experiments, we show that the proposed method not only improves the long-time accuracy of DeepONet while maintaining similar computational cost but also substantially reduces the sample size of the training set.
CVJul 27, 2022
Convolutional Embedding Makes Hierarchical Vision Transformer StrongerCong Wang, Hongmin Xu, Xiong Zhang et al.
Vision Transformers (ViTs) have recently dominated a range of computer vision tasks, yet it suffers from low training data efficiency and inferior local semantic representation capability without appropriate inductive bias. Convolutional neural networks (CNNs) inherently capture regional-aware semantics, inspiring researchers to introduce CNNs back into the architecture of the ViTs to provide desirable inductive bias for ViTs. However, is the locality achieved by the micro-level CNNs embedded in ViTs good enough? In this paper, we investigate the problem by profoundly exploring how the macro architecture of the hybrid CNNs/ViTs enhances the performances of hierarchical ViTs. Particularly, we study the role of token embedding layers, alias convolutional embedding (CE), and systemically reveal how CE injects desirable inductive bias in ViTs. Besides, we apply the optimal CE configuration to 4 recently released state-of-the-art ViTs, effectively boosting the corresponding performances. Finally, a family of efficient hybrid CNNs/ViTs, dubbed CETNets, are released, which may serve as generic vision backbones. Specifically, CETNets achieve 84.9% Top-1 accuracy on ImageNet-1K (training from scratch), 48.6% box mAP on the COCO benchmark, and 51.6% mIoU on the ADE20K, substantially improving the performances of the corresponding state-of-the-art baselines.
IVAug 9, 2022
Longitudinal Prediction of Postnatal Brain Magnetic Resonance Images via a Metamorphic Generative Adversarial NetworkYunzhi Huang, Sahar Ahmad, Luyi Han et al.
Missing scans are inevitable in longitudinal studies due to either subject dropouts or failed scans. In this paper, we propose a deep learning framework to predict missing scans from acquired scans, catering to longitudinal infant studies. Prediction of infant brain MRI is challenging owing to the rapid contrast and structural changes particularly during the first year of life. We introduce a trustworthy metamorphic generative adversarial network (MGAN) for translating infant brain MRI from one time-point to another. MGAN has three key features: (i) Image translation leveraging spatial and frequency information for detail-preserving mapping; (ii) Quality-guided learning strategy that focuses attention on challenging regions. (iii) Multi-scale hybrid loss function that improves translation of tissue contrast and structural details. Experimental results indicate that MGAN outperforms existing GANs by accurately predicting both contrast and anatomical details.
CVAug 6, 2022
Deep Uncalibrated Photometric Stereo via Inter-Intra Image Feature FusionFangzhou Gao, Meng Wang, Lianghao Zhang et al.
Uncalibrated photometric stereo is proposed to estimate the detailed surface normal from images under varying and unknown lightings. Recently, deep learning brings powerful data priors to this underdetermined problem. This paper presents a new method for deep uncalibrated photometric stereo, which efficiently utilizes the inter-image representation to guide the normal estimation. Previous methods use optimization-based neural inverse rendering or a single size-independent pooling layer to deal with multiple inputs, which are inefficient for utilizing information among input images. Given multi-images under different lighting, we consider the intra-image and inter-image variations highly correlated. Motivated by the correlated variations, we designed an inter-intra image feature fusion module to introduce the inter-image representation into the per-image feature extraction. The extra representation is used to guide the per-image feature extraction and eliminate the ambiguity in normal estimation. We demonstrate the effect of our design on a wide range of samples, especially on dark materials. Our method produces significantly better results than the state-of-the-art methods on both synthetic and real data.
NAApr 10, 2017
Uniform regularity for linear kinetic equations with random input based on hypocoercivityQin Li, Li Wang
In this paper we study the effect of randomness in kinetic equations that preserve mass. Our focus is in proving the analyticity of the solution with respect to the randomness, which naturally leads to the convergence of numerical methods. The analysis is carried out in a general setting, with the regularity result not depending on the specific form of the collision term, the probability distribution of the random variables, or the regime the system is in, and thereby termed "uniform". Applications include the linear Boltzmann equation, BGK model, Carlemann model, among many others; and the results hold true in kinetic, parabolic and high field regimes. The proof relies on the explicit expression of the high order derivatives of the solution in the random space, and the convergence in time is mainly based on hypocoercivity, which, despite the popularity in PDE analysis of kinetic theory, has rarely been used for numerical algorithms.
CLSep 17, 2024Code
SpMis: An Investigation of Synthetic Spoken Misinformation DetectionPeizhuo Liu, Li Wang, Renqiang He et al.
In recent years, speech generation technology has advanced rapidly, fueled by generative models and large-scale training techniques. While these developments have enabled the production of high-quality synthetic speech, they have also raised concerns about the misuse of this technology, particularly for generating synthetic misinformation. Current research primarily focuses on distinguishing machine-generated speech from human-produced speech, but the more urgent challenge is detecting misinformation within spoken content. This task requires a thorough analysis of factors such as speaker identity, topic, and synthesis. To address this need, we conduct an initial investigation into synthetic spoken misinformation detection by introducing an open-source dataset, SpMis. SpMis includes speech synthesized from over 1,000 speakers across five common topics, utilizing state-of-the-art text-to-speech systems. Although our results show promising detection capabilities, they also reveal substantial challenges for practical implementation, underscoring the importance of ongoing research in this critical area.
NAJan 6, 2011
Uniformly stable rectangular elements for fourth order elliptic singular perturbation problemsLi Wang, Xiaoping Xie
This paper analyzes rectangular finite element methods for fourth order elliptic singular perturbation problems. We show that the non-$C^0$ rectangular Morley element is uniformly convergent in the energy norm with respect to the perturbation parameter. We also propose a $C^0$ extended high order rectangular Morley element and prove the uniform convergence. Finally, we do some numerical experiments to confirm the theoretical results.
CVOct 21, 2023
Fuzzy-NMS: Improving 3D Object Detection with Fuzzy Classification in NMSLi Wang, Xinyu Zhang, Fachuan Zhao et al.
Non-maximum suppression (NMS) is an essential post-processing module used in many 3D object detection frameworks to remove overlapping candidate bounding boxes. However, an overreliance on classification scores and difficulties in determining appropriate thresholds can affect the resulting accuracy directly. To address these issues, we introduce fuzzy learning into NMS and propose a novel generalized Fuzzy-NMS module to achieve finer candidate bounding box filtering. The proposed Fuzzy-NMS module combines the volume and clustering density of candidate bounding boxes, refining them with a fuzzy classification method and optimizing the appropriate suppression thresholds to reduce uncertainty in the NMS process. Adequate validation experiments are conducted using the mainstream KITTI and large-scale Waymo 3D object detection benchmarks. The results of these tests demonstrate the proposed Fuzzy-NMS module can improve the accuracy of numerous recently NMS-based detectors significantly, including PointPillars, PV-RCNN, and IA-SSD, etc. This effect is particularly evident for small objects such as pedestrians and bicycles. As a plug-and-play module, Fuzzy-NMS does not need to be retrained and produces no obvious increases in inference time.
NAFeb 1, 2016
Implicit Asymptotic Preserving Method for Linear Transport EquationsQin Li, Li Wang
The computation of the radiative transfer equation is expensive mainly due to two stiff terms: the transport term and the collision operator. The stiffness in the former comes from the fact that particles (such as photons) travels at the speed of light, while that in the latter is due to the strong scattering in the diffusive regime. We study the fully implicit scheme for this equation to account for the stiffness. The main challenge in the implicit treatment is the coupling between the spacial and velocity coordinates that requires the large size of the to-be-inverted matrix, which is also ill-conditioned and not necessarily symmetric. Our main idea is to utilize the spectral structure of the ill-conditioned matrix to construct a pre-conditioner, which, along with an exquisite split of the spatial and angular dependence, significantly improve the condition number and allows matrix-free treatment. We also design a fast solver to compute this pre-conditioner explicitly in advance. Meanwhile, we reformulate the system via an even-odd parity, which results in a symmetric and positive definite matrix that can be inverted using conjugate gradient method. This idea can also be implemented to the original non-symmetric system whose inversion is solved by GMRES. A qualitative comparison with the conventional methods, including Krylov iterative method pre-conditioned with diffusive synthetic acceleration and asymptotic preserving scheme via even-odd decomposition, is also discussed.
IVMar 8, 2022
Plug-and-play Shape Refinement Framework for Multi-site and Lifespan Brain Skull StrippingYunxiang Li, Ruilong Dan, Shuai Wang et al.
Skull stripping is a crucial prerequisite step in the analysis of brain magnetic resonance images (MRI). Although many excellent works or tools have been proposed, they suffer from low generalization capability. For instance, the model trained on a dataset with specific imaging parameters cannot be well applied to other datasets with different imaging parameters. Especially, for the lifespan datasets, the model trained on an adult dataset is not applicable to an infant dataset due to the large domain difference. To address this issue, numerous methods have been proposed, where domain adaptation based on feature alignment is the most common. Unfortunately, this method has some inherent shortcomings, which need to be retrained for each new domain and requires concurrent access to the input images of both domains. In this paper, we design a plug-and-play shape refinement (PSR) framework for multi-site and lifespan skull stripping. To deal with the domain shift between multi-site lifespan datasets, we take advantage of the brain shape prior, which is invariant to imaging parameters and ages. Experiments demonstrate that our framework can outperform the state-of-the-art methods on multi-site lifespan datasets.
NAOct 6, 2017
Stability of Stationary Inverse Transport Equation in Diffusion ScalingKe Chen, Qin Li, Li Wang
We consider the inverse problem of reconstructing the optical parameters for stationary radiative transfer equation (RTE) from velocity-averaged measurement. The RTE often contains multiple scales characterized by the magnitude of a dimensionless parameter---the Knudsen number ($K_n$). In the diffusive scaling ($K_n \ll 1$), the stationary RTE is well approximated by an elliptic equation in the forward setting. However, the inverse problem for the elliptic equation is acknowledged to be severely ill-posed as compared to the well-posedness of inverse transport equation, which raises the question of how uniqueness being lost as $K_n \rightarrow 0$. We tackle this problem by examining the stability of inverse problem with varying $K_n$. We show that, the discrepancy in two measurements is amplified in the reconstructed parameters at the order of $K_n^p~ (p = 1\text{ or} ~2)$, and as a result lead to ill-posedness in the zero limit of $K_n$. Our results apply to both continuous and discrete settings. Some numerical tests are performed in the end to validate these theoretical findings.
CLMar 2, 2023
Adopting the Multi-answer Questioning Task with an Auxiliary Metric for Extreme Multi-label Text Classification Utilizing the Label HierarchyLi Wang, Ying Wah Teh, Mohammed Ali Al-Garadi
Extreme multi-label text classification utilizes the label hierarchy to partition extreme labels into multiple label groups, turning the task into simple multi-group multi-label classification tasks. Current research encodes labels as a vector with fixed length which needs establish multiple classifiers for different label groups. The problem is how to build only one classifier without sacrificing the label relationship in the hierarchy. This paper adopts the multi-answer questioning task for extreme multi-label classification. This paper also proposes an auxiliary classification evaluation metric. This study adopts the proposed method and the evaluation metric to the legal domain. The utilization of legal Berts and the study on task distribution are discussed. The experiment results show that the proposed hierarchy and multi-answer questioning task can do extreme multi-label classification for EURLEX dataset. And in minor/fine-tuning the multi-label classification task, the domain adapted BERT models could not show apparent advantages in this experiment. The method is also theoretically applicable to zero-shot learning.
CRMar 24Code
Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMsWenyu Chen, Xiangtao Meng, Chuanchao Zang et al.
Large Language Models(LLMs) are widely deployed, yet are vulnerable to jailbreak prompts that elicit policy-violating outputs. Although prior studies have uncovered these risks, they typically treat all tokens as equally important during prompt mutation, overlooking the varying contributions of individual tokens to triggering model refusals. Consequently, these attacks introduce substantial redundant searching under query-constrained scenarios, reducing attack efficiency and hindering comprehensive vulnerability assessment. In this work, we conduct a token-level analysis of refusal behavior and observe that token contributions are highly skewed rather than uniform. Moreover, we find strong cross-model consistency in refusal tendencies, enabling the use of a surrogate model to estimate token-level contributions to the target model's refusals. Motivated by these findings, we propose TriageFuzz, a token-aware jailbreak fuzzing framework that adapts the fuzz testing approach with a series of customized designs. TriageFuzz leverages a surrogate model to estimate the contribution of individual tokens to refusal behaviors, enabling the identification of sensitive regions within the prompt. Furthermore, it incorporates a refusal-guided evolutionary strategy that adaptively weights candidate prompts with a lightweight scorer to steer the evolution toward bypassing safety constraints. Extensive experiments on six open-source LLMs and three commercial APIs demonstrate that TriageFuzz achieves comparable attack success rates (ASR) with significantly reduced query costs. Notably, it attains a 90% ASR with over 70% fewer queries compared to baselines. Even under an extremely restrictive budget of 25 queries, TriageFuzz outperforms existing methods, improving ASR by 20-40%.
NAApr 21
A Proximal Primal-Dual Approach to Generalized JKO Schemes for Doubly Nonlinear Parabolic EquationsLuis M. Briceño-Arias, José A. Carrillo, Dante Kalise et al.
Variational methods based on optimization strategies are proposed to numerically solve a large family of nonlinear partial differential equations. They are all particular instances of gradient flows with general costs, including the $p$-Laplace equation and flux-limited equations such as the relativistic heat equation. This is achieved by computing explicit formulas for proximal operators with general costs amenable to efficient numerical approximation. We showcase our numerical approach via validation of the results by recovering the qualitative behavior of particular known cases of this large family of steepest descent evolutions.
CVMar 4, 2022
Computer-Aided Road Inspection: Systems and AlgorithmsRui Fan, Sicen Guo, Li Wang et al.
Road damage is an inconvenience and a safety hazard, severely affecting vehicle condition, driving comfort, and traffic safety. The traditional manual visual road inspection process is pricey, dangerous, exhausting, and cumbersome. Also, manual road inspection results are qualitative and subjective, as they depend entirely on the inspector's personal experience. Therefore, there is an ever-increasing need for automated road inspection systems. This chapter first compares the five most common road damage types. Then, 2-D/3-D road imaging systems are discussed. Finally, state-of-the-art machine vision and intelligence-based road damage detection algorithms are introduced.
CRJan 16Code
VidLeaks: Membership Inference Attacks Against Text-to-Video ModelsLi Wang, Wenyu Chen, Ning Yu et al.
The proliferation of powerful Text-to-Video (T2V) models, trained on massive web-scale datasets, raises urgent concerns about copyright and privacy violations. Membership inference attacks (MIAs) provide a principled tool for auditing such risks, yet existing techniques - designed for static data like images or text - fail to capture the spatio-temporal complexities of video generation. In particular, they overlook the sparsity of memorization signals in keyframes and the instability introduced by stochastic temporal dynamics. In this paper, we conduct the first systematic study of MIAs against T2V models and introduce a novel framework VidLeaks, which probes sparse-temporal memorization through two complementary signals: 1) Spatial Reconstruction Fidelity (SRF), using a Top-K similarity to amplify spatial memorization signals from sparsely memorized keyframes, and 2) Temporal Generative Stability (TGS), which measures semantic consistency across multiple queries to capture temporal leakage. We evaluate VidLeaks under three progressively restrictive black-box settings - supervised, reference-based, and query-only. Experiments on three representative T2V models reveal severe vulnerabilities: VidLeaks achieves AUC of 82.92% on AnimateDiff and 97.01% on InstructVideo even in the strict query-only setting, posing a realistic and exploitable privacy risk. Our work provides the first concrete evidence that T2V models leak substantial membership information through both sparse and temporal memorization, establishing a foundation for auditing video generation systems and motivating the development of new defenses. Code is available at: https://zenodo.org/records/17972831.
LGAug 12, 2022
Scalable and Sparsity-Aware Privacy-Preserving K-means Clustering with Application to Fraud DetectionYingting Liu, Chaochao Chen, Jamie Cui et al.
K-means is one of the most widely used clustering models in practice. Due to the problem of data isolation and the requirement for high model performance, how to jointly build practical and secure K-means for multiple parties has become an important topic for many applications in the industry. Existing work on this is mainly of two types. The first type has efficiency advantages, but information leakage raises potential privacy risks. The second type is provable secure but is inefficient and even helpless for the large-scale data sparsity scenario. In this paper, we propose a new framework for efficient sparsity-aware K-means with three characteristics. First, our framework is divided into a data-independent offline phase and a much faster online phase, and the offline phase allows to pre-compute almost all cryptographic operations. Second, we take advantage of the vectorization techniques in both online and offline phases. Third, we adopt a sparse matrix multiplication for the data sparsity scenario to improve efficiency further. We conduct comprehensive experiments on three synthetic datasets and deploy our model in a real-world fraud detection task. Our experimental results show that, compared with the state-of-the-art solution, our model achieves competitive performance in terms of both running time and communication size, especially on sparse datasets.
CLMar 23Code
TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context CompressionLi Wang, Yandong Wang, Xin Yu et al.
The rapid progress of large language models (LLMs) has led to remarkable performance gains across a wide range of tasks. However, when handling long documents that exceed the model's context window limit, the entire context cannot be processed in a single pass, making chunk-wise processing necessary. This requires multiple turns to read different chunks and update memory. However, supervision is typically provided only by the final outcome, which makes it difficult to evaluate the quality of memory updates at each turn in the multi-turn training setting. This introduces a temporal credit assignment challenge. Existing approaches, such as LLM-as-a-judge or process reward models, incur substantial computational overhead and suffer from estimation noise. To better address the credit assignment problem in multi-turn memory training, we propose Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning (TAMTRL). TAMTRL leverages relevant documents as teacher signals by aligning them with each turn of model input and assigns rewards through normalized probabilities in a self-supervised manner. This provides fine-grained learning signals for each memory update and improves long-context processing. Experiments with multiple models of varying scales across seven long-context benchmarks show that TAMTRL consistently outperforms strong baselines, demonstrating its effectiveness. Our code is available at https://anonymous.4open.science/r/TAMTRL-F1F8.
CVApr 16, 2024Code
The Ninth NTIRE 2024 Efficient Super-Resolution Challenge ReportBin Ren, Yawei Li, Nancy Mehta et al.
This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/.
LGMay 1Code
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement LearningZihan Lin, Xiaohan Wang, Jie Cao et al.
Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4\% in Avg@16 and 7.0\% in Pass@128. Code is available at https://github.com/1229095296/ResRL.git.
AIMay 7Code
MASPO: Joint Prompt Optimization for LLM-based Multi-Agent SystemsZhexuan Wang, Xuebo Liu, Li Wang et al.
Large language model (LLM)-based Multi-agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role-specific prompts. While the quality of these prompts is pivotal, jointly optimizing them across interacting agents remains a non-trivial challenge, primarily due to the misalignment between local agent objectives and holistic system goals. To address this, we introduce MASPO, a novel framework designed to automatically and iteratively refine prompts across the entire system. A core innovation of MASPO is its joint evaluation mechanism, which assesses prompts not merely by their local validity, but by their capacity to facilitate downstream success for successor agents. This effectively bridges the gap between local interactions and global outcomes without relying on ground-truth labels. Furthermore, MASPO employs a data-driven evolutionary beam search to efficiently navigate the high-dimensional prompt space. Extensive empirical evaluations across 6 diverse tasks demonstrate that MASPO consistently outperforms state-of-the-art prompt optimization methods, achieving an average accuracy improvement of 2.9. We release our code at https://github.com/wangzx1219/MASPO.
OCJul 9, 2024
Solving General Natural-Language-Description Optimization Problems with Large Language ModelsJihai Zhang, Wei Wang, Siyan Guo et al.
Optimization problems seek to find the best solution to an objective under a set of constraints, and have been widely investigated in real-world applications. Modeling and solving optimization problems in a specific domain typically require a combination of domain knowledge, mathematical skills, and programming ability, making it difficult for general users and even domain professionals. In this paper, we propose a novel framework called OptLLM that augments LLMs with external solvers. Specifically, OptLLM accepts user queries in natural language, convert them into mathematical formulations and programming codes, and calls the solvers to calculate the results for decision-making. In addition, OptLLM supports multi-round dialogues to gradually refine the modeling and solving of optimization problems. To illustrate the effectiveness of OptLLM, we provide tutorials on three typical optimization applications and conduct experiments on both prompt-based GPT models and a fine-tuned Qwen model using a large-scale selfdeveloped optimization dataset. Experimental results show that OptLLM works with various LLMs, and the fine-tuned model achieves an accuracy boost compared to the promptbased models. Some features of OptLLM framework have been available for trial since June 2023 (https://opt.alibabacloud.com/chat or https://opt.aliyun.com/chat).
SDApr 20
VoxSafeBench: Not Just What Is Said, but Who, How, and WhereYuxiang Wang, Hongyu Liu, Yijiang Xu et al.
As speech language models (SLMs) transition from personal devices into shared, multi-user environments, their responses must account for far more than the words alone. Who is speaking, how they sound, and where the conversation takes place can each turn an otherwise benign request into one that is unsafe, unfair, or privacy-violating. Existing benchmarks, however, largely focus on basic audio comprehension, study individual risks in isolation, or conflate content that is inherently harmful with content that only becomes problematic due to its acoustic context. We introduce VoxSafeBench, among the first benchmarks to jointly evaluate social alignment in SLMs across three dimensions: safety, fairness, and privacy. VoxSafeBench adopts a Two-Tier design: Tier1 evaluates content-centric risks using matched text and audio inputs, while Tier2 targets audio-conditioned risks in which the transcript is benign but the appropriate response hinges on the speaker, paralinguistic cues, or the surrounding environment. To validate Tier2, we include intermediate perception probes and confirm that frontier SLMs can successfully detect these acoustic cues yet still fail to act on them appropriately. Across 22 tasks with bilingual coverage, we find that safeguards appearing robust on text often degrade in speech: safety awareness drops for speaker- and scene-conditioned risks, fairness erodes when demographic differences are conveyed vocally, and privacy protections falter when contextual cues arrive acoustically. Together, these results expose a pervasive speech grounding gap: current SLMs frequently recognize the relevant social norm in text but fail to apply it when the decisive cue must be grounded in speech. Code and data are publicly available at: https://amphionteam.github.io/VoxSafeBench_demopage/
ROApr 14
HazardArena: Evaluating Semantic Safety in Vision-Language-Action ModelsZixing Chen, Yifeng Gao, Li Wang et al.
Vision-Language-Action (VLA) models inherit rich world knowledge from vision-language backbones and acquire executable skills via action demonstrations. However, existing evaluations largely focus on action execution success, leaving action policies loosely coupled with visual-linguistic semantics. This decoupling exposes a systematic vulnerability whereby correct action execution may induce unsafe outcomes under semantic risk. To expose this vulnerability, we introduce HazardArena, a benchmark designed to evaluate semantic safety in VLAs under controlled yet risk-bearing contexts. HazardArena is constructed from safe/unsafe twin scenarios that share matched objects, layouts, and action requirements, differing only in the semantic context that determines whether an action is unsafe. We find that VLA models trained exclusively on safe scenarios often fail to behave safely when evaluated in their corresponding unsafe counterparts. HazardArena includes over 2,000 assets and 40 risk-sensitive tasks spanning 7 real-world risk categories grounded in established robotic safety standards. To mitigate this vulnerability, we propose a training-free Safety Option Layer that constrains action execution using semantic attributes or a vision-language judge, substantially reducing unsafe behaviors with minimal impact on task performance. We hope that HazardArena highlights the need to rethink how semantic safety is evaluated and enforced in VLAs as they scale toward real-world deployment.
ITAug 26, 2024
Scalable Multivariate Fronthaul Quantization for Cell-Free Massive MIMOSangwoo Park, Ahmet Hasim Gokceoglu, Li Wang et al.
The conventional approach to the fronthaul design for cell-free massive MIMO system follows the compress-and-precode (CP) paradigm. Accordingly, encoded bits and precoding coefficients are shared by the distributed unit (DU) on the fronthaul links, and precoding takes place at the radio units (RUs). Previous theoretical work has shown that CP can be potentially improved by a significant margin by precode-and-compress (PC) methods, in which all baseband processing is carried out at the DU, which compresses the precoded signals for transmission on the fronthaul links. The theoretical performance gain of PC methods are particularly pronounced when the DU implements multivariate quantization (MQ), applying joint quantization across the signals for all the RUs. However, existing solutions for MQ are characterized by a computational complexity that grows exponentially with the sum-fronthaul capacity from the DU to all RUs. This work sets out to design scalable MQ strategies for PC-based cell-free massive MIMO systems. For the low-fronthaul capacity regime, we present alpha-parallel MQ (alpha-PMQ), whose complexity is exponential only in the fronthaul capacity towards an individual RU, while performing close to full MQ. alpha-PMQ tailors MQ to the topology of the network by allowing for parallel local quantization steps for RUs that do not interfere too much with each other. For the high-fronthaul capacity regime, we then introduce neural MQ, which replaces the exhaustive search in MQ with gradient-based updates for a neural-network-based decoder, attaining a complexity that grows linearly with the sum-fronthaul capacity. Numerical results demonstrate that the proposed scalable MQ strategies outperform CP for both the low and high-fronthaul capacity regimes at the cost of increased computational complexity at the DU (but not at the RUs).