IVSep 7, 2022Code
Spach Transformer: Spatial and Channel-wise Transformer Based on Local and Global Self-attentions for PET Image DenoisingSe-In Jang, Tinsu Pan, Ye Li et al.
Position emission tomography (PET) is widely used in clinics and research due to its quantitative merits and high sensitivity, but suffers from low signal-to-noise ratio (SNR). Recently convolutional neural networks (CNNs) have been widely used to improve PET image quality. Though successful and efficient in local feature extraction, CNN cannot capture long-range dependencies well due to its limited receptive field. Global multi-head self-attention (MSA) is a popular approach to capture long-range information. However, the calculation of global MSA for 3D images has high computational costs. In this work, we proposed an efficient spatial and channel-wise encoder-decoder transformer, Spach Transformer, that can leverage spatial and channel information based on local and global MSAs. Experiments based on datasets of different PET tracers, i.e., $^{18}$F-FDG, $^{18}$F-ACBC, $^{18}$F-DCFPyL, and $^{68}$Ga-DOTATATE, were conducted to evaluate the proposed framework. Quantitative results show that the proposed Spach Transformer framework outperforms state-of-the-art deep learning architectures. Our codes are available at https://github.com/sijang/SpachTransformer
ROMay 26
GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic ManipulationBoxiang Qiu, Liliang Chen, Yue Liao et al.
We introduce GE-Sim 2.0 (Genie Envisioner World Simulator 2.0), a closed-loop video world simulator for robotic manipulation. Building on the action-conditioned video generation framework of Genie Envisioner, GE-Sim 2.0 is re-trained on thousands of hours of real-world robot data spanning teleoperation, contact-rich interaction, and on-robot policy deployment, substantially improving action-following fidelity and trajectory coverage. On top of this foundation, three new modules close the loop from video simulation to policy learning: a state expert that decodes proprioceptive state from video latents to support next-chunk prediction by downstream VLA policies; a world judge that scores generated rollouts against task instructions, yielding machine-verifiable success signals and rewards in place of manual inspection; and an acceleration framework that delivers a 25-frame rollout in 2.3 seconds on a single H100, with up to 4* frame skipping at inference for long-horizon evaluation. GE-Sim 2.0 tops the public WorldArena leaderboard at only 2B parameters, outperforming both dedicated robotic world models and closed-source general video generators, and policies trained against its rollouts and rewards translate into measurable real-world gains, establishing GE-Sim 2.0 as a practical platform for scalable evaluation and closed-loop learning of manipulation policies.
ROMay 28
ElegantVLA: Learning When to Think for Efficient Vision-Language-Action ModelsYe Li, Huanan Liu, Kangye Ji et al.
Vision-Language-Action (VLA) models are a powerful paradigm for generalist robotic control. However, their high computational cost and limited control frequency hinder real-time robotic manipulation, especially when large vision-language backbones and iterative action heads run at every control step. Existing VLA acceleration methods often optimize individual components or rely on fixed acceleration rules, treating different control steps with largely fixed computation and overlooking the non-uniform reasoning demands of sequential embodied control. Inspired by human motor control, where cognitive and feedback resources concentrate on goal-sensitive stages, we argue that VLA models should learn when to invest full computation and when to reuse prior computation. We propose ElegantVLA, a plug-in phase-adaptive inference framework that accelerates VLA models through intra-model dynamic compute scheduling. ElegantVLA introduces a lightweight scheduler that observes temporal representation similarity, robot-motion cues, and episode progress to jointly allocate computation across the vision encoder, LLM, and action head. For perception-language reasoning, the scheduler selects a five-level Vision-LLM compute mode, from full recomputation to multi-step temporal reuse, based on visual-language representation stability. For action generation, it selects a three-level denoising mode, reusing intermediate denoising states during stable motion while preserving full refinement for goal-sensitive stages. By coordinating these decisions, ElegantVLA offers a general acceleration framework for modern VLA pipelines with explicit action-generation modules, without modifying or retraining the base model. Experiments on GR00T and CogACT achieve up to 2.55x and 3.77x speedup, and on six real-world GR00T tasks ElegantVLA cuts computation by 2.18x while raising control frequency from 13.8 Hz to 26.3 Hz.
CVJan 4, 2023Code
Underwater Object Tracker: UOSTrack for Marine Organism Grasping of Underwater VehiclesYunfeng Li, Bo Wang, Ye Li et al.
A visual single-object tracker is an indispensable component of underwater vehicles (UVs) in marine organism grasping tasks. Its accuracy and stability are imperative to guide the UVs to perform grasping behavior. Although single-object trackers show competitive performance in the challenge of underwater image degradation, there are still issues with sample imbalance and exclusion of similar objects that need to be addressed for application in marine organism grasping. This paper proposes Underwater OSTrack (UOSTrack), which consists of underwater image and open-air sequence hybrid training (UOHT), and motion-based post-processing (MBPP). The UOHT training paradigm is designed to train the sample-imbalanced underwater tracker so that the tracker is exposed to a great number of underwater domain training samples and learns the feature expressions. The MBPP paradigm is proposed to exclude similar objects. It uses the estimation box predicted with a Kalman filter and the candidate boxes in the response map to relocate the lost tracked object in the candidate area. UOSTrack achieves an average performance improvement of 4.41% and 7.98% maximum compared to state-of-the-art methods on various benchmarks, respectively. Field experiments have verified the accuracy and stability of our proposed UOSTrack for UVs in marine organism grasping tasks. More details can be found at https://github.com/LiYunfengLYF/UOSTrack.
CVSep 9, 2023Code
UnitModule: A Lightweight Joint Image Enhancement Module for Underwater Object DetectionZhuoyan Liu, Bo Wang, Ye Li et al.
Underwater object detection faces the problem of underwater image degradation, which affects the performance of the detector. Underwater object detection methods based on noise reduction and image enhancement usually do not provide images preferred by the detector or require additional datasets. In this paper, we propose a plug-and-play \textbf{U}nderwater joi\textbf{n}t \textbf{i}mage enhancemen\textbf{t} \textbf{Module} (UnitModule) that provides the input image preferred by the detector. We design an unsupervised learning loss for the joint training of UnitModule with the detector without additional datasets to improve the interaction between UnitModule and the detector. Furthermore, a color cast predictor with the assisting color cast loss and a data augmentation called Underwater Color Random Transfer (UCRT) are designed to improve the performance of UnitModule on underwater images with different color casts. Extensive experiments are conducted on DUO for different object detection models, where UnitModule achieves the highest performance improvement of 2.6 AP for YOLOv5-S and gains the improvement of 3.3 AP on the brand-new test set (\(\text{URPC}_{test}\)). And UnitModule significantly improves the performance of all object detection models we test, especially for models with a small number of parameters. In addition, UnitModule with a small number of parameters of 31K has little effect on the inference speed of the original object detection model. Our quantitative and visual analysis also demonstrates the effectiveness of UnitModule in enhancing the input image and improving the perception ability of the detector for object features. The code is available at https://github.com/LEFTeyex/UnitModule.
CVOct 9, 2023Code
Lightweight Full-Convolutional Siamese TrackerYunfeng Li, Bo Wang, Xueyi Wu et al.
Although single object trackers have achieved advanced performance, their large-scale models hinder their application on limited resources platforms. Moreover, existing lightweight trackers only achieve a balance between 2-3 points in terms of parameters, performance, Flops and FPS. To achieve the optimal balance among these points, this paper proposes a lightweight full-convolutional Siamese tracker called LightFC. LightFC employs a novel efficient cross-correlation module (ECM) and a novel efficient rep-center head (ERH) to improve the feature representation of the convolutional tracking pipeline. The ECM uses an attention-like module design, which conducts spatial and channel linear fusion of fused features and enhances the nonlinearity of the fused features. Additionally, it refers to successful factors of current lightweight trackers and introduces skip-connections and reuse of search area features. The ERH reparameterizes the feature dimensional stage in the standard center-head and introduces channel attention to optimize the bottleneck of key feature flows. Comprehensive experiments show that LightFC achieves the optimal balance between performance, parameters, Flops and FPS. The precision score of LightFC outperforms MixFormerV2-S on LaSOT and TNL2K by 3.7 % and 6.5 %, respectively, while using 5x fewer parameters and 4.6x fewer Flops. Besides, LightFC runs 2x faster than MixFormerV2-S on CPUs. In addition, a higher-performance version named LightFC-vit is proposed by replacing a more powerful backbone network. The code and raw results can be found at https://github.com/LiYunfengLYF/LightFC.
CVJul 6, 2024Code
PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT InferenceYe Li, Chen Tang, Yuan Meng et al.
We introduce PRANCE, a Vision Transformer compression framework that jointly optimizes the activated channels and reduces tokens, based on the characteristics of inputs. Specifically, PRANCE~ leverages adaptive token optimization strategies for a certain computational budget, aiming to accelerate ViTs' inference from a unified data and architectural perspective. However, the joint framework poses challenges to both architectural and decision-making aspects. Firstly, while ViTs inherently support variable-token inference, they do not facilitate dynamic computations for variable channels. To overcome this limitation, we propose a meta-network using weight-sharing techniques to support arbitrary channels of the Multi-head Self-Attention and Multi-layer Perceptron layers, serving as a foundational model for architectural decision-making. Second, simultaneously optimizing the structure of the meta-network and input data constitutes a combinatorial optimization problem with an extremely large decision space, reaching up to around $10^{14}$, making supervised learning infeasible. To this end, we design a lightweight selector employing Proximal Policy Optimization for efficient decision-making. Furthermore, we introduce a novel "Result-to-Go" training mechanism that models ViTs' inference process as a Markov decision process, significantly reducing action space and mitigating delayed-reward issues during training. Extensive experiments demonstrate the effectiveness of PRANCE~ in reducing FLOPs by approximately 50\%, retaining only about 10\% of tokens while achieving lossless Top-1 accuracy. Additionally, our framework is shown to be compatible with various token optimization techniques such as pruning, merging, and sequential pruning-merging strategies. The code is available at \href{https://github.com/ChildTang/PRANCE}{https://github.com/ChildTang/PRANCE}.
IVMar 15, 2022
A Noise-level-aware Framework for PET Image DenoisingYe Li, Jianan Cui, Junyu Chen et al.
In PET, the amount of relative (signal-dependent) noise present in different body regions can be significantly different and is inherently related to the number of counts present in that region. The number of counts in a region depends, in principle and among other factors, on the total administered activity, scanner sensitivity, image acquisition duration, radiopharmaceutical tracer uptake in the region, and patient local body morphometry surrounding the region. In theory, less amount of denoising operations is needed to denoise a high-count (low relative noise) image than images a low-count (high relative noise) image, and vice versa. The current deep-learning-based methods for PET image denoising are predominantly trained on image appearance only and have no special treatment for images of different noise levels. Our hypothesis is that by explicitly providing the local relative noise level of the input image to a deep convolutional neural network (DCNN), the DCNN can outperform itself trained on image appearance only. To this end, we propose a noise-level-aware framework denoising framework that allows embedding of local noise level into a DCNN. The proposed is trained and tested on 30 and 15 patient PET images acquired on a GE Discovery MI PET/CT system. Our experiments showed that the increases in both PSNR and SSIM from our backbone network with relative noise level embedding (NLE) versus the same network without NLE were statistically significant with p<0.001, and the proposed method significantly outperformed a strong baseline method by a large margin.
SPJul 3, 2024
Generative AI Enables EEG Super-Resolution via Spatio-Temporal Adaptive Diffusion LearningShuqiang Wang, Tong Zhou, Yanyan Shen et al.
Electroencephalogram (EEG) technology, particularly high-density EEG (HD EEG) devices, is widely used in fields such as neuroscience. HD EEG devices improve the spatial resolution of EEG by placing more electrodes on the scalp, which meet the requirements of clinical diagnostic applications such as epilepsy focus localization. However, this technique faces challenges, such as high acquisition costs and limited usage scenarios. In this paper, spatio-temporal adaptive diffusion models (STAD) are proposed to pioneer the use of diffusion models for achieving spatial SR reconstruction from low-resolution (LR, 64 channels or fewer) EEG to high-resolution (HR, 256 channels) EEG. Specifically, a spatio-temporal condition module is designed to extract the spatio-temporal features of LR EEG, which are then used as conditional inputs to direct the reverse denoising process. Additionally, a multi-scale Transformer denoising module is constructed to leverage multi-scale convolution blocks and cross-attention-based diffusion Transformer blocks for conditional guidance to generate subject-adaptive SR EEG. Experimental results demonstrate that the STAD significantly enhances the spatial resolution of LR EEG and quantitatively outperforms existing methods. Furthermore, STAD demonstrate their value by applying synthetic SR EEG to classification and source localization tasks, indicating their potential to substantially boost the spatial resolution of EEG.
CVAug 11, 2024Code
U-DECN: End-to-End Underwater Object Detection ConvNet with Improved DeNoising TrainingZhuoyan Liu, Bo Wang, Bing Wang et al.
Underwater object detection has higher requirements of running speed and deployment efficiency for the detector due to its specific environmental challenges. NMS of two- or one-stage object detectors and transformer architecture of query-based end-to-end object detectors are not conducive to deployment on underwater embedded devices with limited processing power. As for the detrimental effect of underwater color cast noise, recent underwater object detectors make network architecture or training complex, which also hinders their application and deployment on unmanned underwater vehicles. In this paper, we propose the Underwater DECO with improved deNoising training (U-DECN), the query-based end-to-end object detector (with ConvNet encoder-decoder architecture) for underwater color cast noise that addresses the above problems. We integrate advanced technologies from DETR variants into DECO and design optimization methods specifically for the ConvNet architecture, including Deformable Convolution in SIM and Separate Contrastive DeNoising Forward methods. To address the underwater color cast noise issue, we propose an Underwater Color DeNoising Query method to improve the generalization of the model for the biased object feature information by different color cast noise. Our U-DECN, with ResNet-50 backbone, achieves the best 64.0 AP on DUO and the best 58.1 AP on RUOD, and 21 FPS (5 times faster than Deformable DETR and DINO 4 FPS) on NVIDIA AGX Orin by TensorRT FP16, outperforming the other state-of-the-art query-based end-to-end object detectors. The code is available at https://github.com/LEFTeyex/U-DECN.
LGNov 15, 2022
Contextual Transformer for Offline Meta Reinforcement LearningRunji Lin, Ye Li, Xidong Feng et al.
The pretrain-finetuning paradigm in large-scale sequence models has made significant progress in natural language processing and computer vision tasks. However, such a paradigm is still hindered by several challenges in Reinforcement Learning (RL), including the lack of self-supervised pretraining algorithms based on offline data and efficient fine-tuning/prompt-tuning over unseen downstream tasks. In this work, we explore how prompts can improve sequence modeling-based offline reinforcement learning (offline-RL) algorithms. Firstly, we propose prompt tuning for offline RL, where a context vector sequence is concatenated with the input to guide the conditional policy generation. As such, we can pretrain a model on the offline dataset with self-supervised loss and learn a prompt to guide the policy towards desired actions. Secondly, we extend our framework to Meta-RL settings and propose Contextual Meta Transformer (CMT); CMT leverages the context among different tasks as the prompt to improve generalization on unseen tasks. We conduct extensive experiments across three different offline-RL settings: offline single-agent RL on the D4RL dataset, offline Meta-RL on the MuJoCo benchmark, and offline MARL on the SMAC benchmark. Superior results validate the strong performance, and generality of our methods.
LGMay 5, 2022
LPC-AD: Fast and Accurate Multivariate Time Series Anomaly Detection via Latent Predictive CodingZhi Qi, Hong Xie, Ye Li et al.
This paper proposes LPC-AD, a fast and accurate multivariate time series (MTS) anomaly detection method. LPC-AD is motivated by the ever-increasing needs for fast and accurate MTS anomaly detection methods to support fast troubleshooting in cloud computing, micro-service systems, etc. LPC-AD is fast in the sense that its reduces the training time by as high as 38.2% compared to the state-of-the-art (SOTA) deep learning methods that focus on training speed. LPC-AD is accurate in the sense that it improves the detection accuracy by as high as 18.9% compared to SOTA sophisticated deep learning methods that focus on enhancing detection accuracy. Methodologically, LPC-AD contributes a generic architecture LPC-Reconstruct for one to attain different trade-offs between training speed and detection accuracy. More specifically, LPC-Reconstruct is built on ideas from autoencoder for reducing redundancy in time series, latent predictive coding for capturing temporal dependence in MTS, and randomized perturbation for avoiding overfitting of anomalous dependence in the training data. We present simple instantiations of LPC-Reconstruct to attain fast training speed, where we propose a simple randomized perturbation method. The superior performance of LPC-AD over SOTA methods is validated by extensive experiments on four large real-world datasets. Experiment results also show the necessity and benefit of each component of the LPC-Reconstruct architecture and that LPC-AD is robust to hyper parameters.
AIJan 28, 2023
Interactive Log Parsing via Light-weight User FeedbackLiming Wang, Hong Xie, Ye Li et al.
Template mining is one of the foundational tasks to support log analysis, which supports the diagnosis and troubleshooting of large scale Web applications. This paper develops a human-in-the-loop template mining framework to support interactive log analysis, which is highly desirable in real-world diagnosis or troubleshooting of Web applications but yet previous template mining algorithms fails to support it. We formulate three types of light-weight user feedbacks and based on them we design three atomic human-in-the-loop template mining algorithms. We derive mild conditions under which the outputs of our proposed algorithms are provably correct. We also derive upper bounds on the computational complexity and query complexity of each algorithm. We demonstrate the versatility of our proposed algorithms by combining them to improve the template mining accuracy of five representative algorithms over sixteen widely used benchmark datasets.
CVMay 26
ReCA: Multi-Shot Long Video Extrapolation via Recursive Context AllocationAkide Liu, Jinbo Xing, Chaojie Mao et al.
Minute-scale cinematic video generation is a central challenge for generative video models. Existing paradigms address only fragments of this challenge: single-shot extrapolation preserves an anchor but lacks cinematic structure, while multi-shot storytelling imposes structure yet remains free to invent its visual states rather than continue an observed one. We define Multi-Shot Video Extrapolation (MSVE), a task that extends an observed frame or clip into a sequence of cinematically structured shots while preserving anchor state and advancing narrative intent. This setting operates under the finite per-call generation budget of short-video models. We identify three coupled bottlenecks: (1) global planners over-specify unsupported details from full screenplays; (2) shot-level prompts dilute task-relevant state when carrying the complete story; and (3) temporal chaining turns generated frames into a lossy memory in which identity, scene, object, and action state decay. MSVE reveals that long-video failure is not merely a limitation of context length, but a failure of context allocation. We propose Recursive Context Allocation (ReCA), an inference-time framework that allocates context hierarchically across planning and generation. ReCA recursively decomposes MSVE into context-bounded subproblems, invokes frozen generators at leaf nodes, and propagates structured state updates across time. To evaluate this setting, we further propose MSVE-Bench and NB-Q, a source-grounded protocol with prompts purpose-built for 3 to 5 minute long-video generation, a regime not addressed by existing short-clip benchmarks. Compared to previous methods, ReCA improves average normalized score by 8 to 16 percent over the strongest competing controller and improves multi-shot consistency metrics by 28 to 43 percent. View the project page at https://reca.vmv.re.
IVDec 21, 2022
Investigation of Network Architecture for Multimodal Head-and-Neck Tumor SegmentationYe Li, Junyu Chen, Se-in Jang et al.
Inspired by the recent success of Transformers for Natural Language Processing and vision Transformer for Computer Vision, many researchers in the medical imaging community have flocked to Transformer-based networks for various main stream medical tasks such as classification, segmentation, and estimation. In this study, we analyze, two recently published Transformer-based network architectures for the task of multimodal head-and-tumor segmentation and compare their performance to the de facto standard 3D segmentation network - the nnU-Net. Our results showed that modeling long-range dependencies may be helpful in cases where large structures are present and/or large field of view is needed. However, for small structures such as head-and-neck tumor, the convolution-based U-Net architecture seemed to perform well, especially when training dataset is small and computational resource is limited.
CVMay 13Code
Test-time Sparsity for Extreme Fast Action DiffusionKangye Ji, Yuan Meng, Jianbo Zhou et al.
Action diffusion excels at high-fidelity action generation but incurs heavy computational costs owing to its iterative denoising nature. Despite current technologies showing promise in accelerating diffusion transformers by reusing the cached features, they struggle to adapt to policy dynamics arising from diverse perceptions and multi-round rollout iterations in open environments. We propose test-time sparsity to tackle this challenge, which aims to accelerate action diffusion by dynamically predicting prunable residual computations for each model forward at test time. However, two bottlenecks remain in this paradigm: 1) repetitive conditional encoding and pruning offset most potential speed gains, and 2) the features cached from previous denoising timesteps cannot constrain large pruning errors under aggressive sparsity. To address the first bottleneck, we design a highly parallelized inference pipeline that minimizes the non-decoder delay to milliseconds. Specifically, we first design a lightweight pruner that shares the encoder with the diffusion transformer. Then, we decouple the encoding and pruning from the autoregressive denoising loop by processing all denoising timesteps in parallel, and overlap the pruner with the decoder forward inference through asynchronism. To overcome the second bottleneck, we introduce an omnidirectional reusing strategy, which achieves 95% sparsity by selectively reusing features cached from the current forward, previous denoising timesteps, and earlier rollout iterations. To learn the rollout-level reusing strategies, we sample a few action trajectories to supervise the sparsified diffusion step by step. Extensive experiments demonstrate that our method reduces FLOPs by 92% and accelerates action generation by 5x, achieving lossless performance with an inference frequency of 47.5 Hz. Our code is available at https://github.com/ky-ji/Test-time-Sparsity.
QUANT-PHJan 9, 2023
VQNet 2.0: A New Generation Machine Learning Framework that Unifies Classical and QuantumHuanyu Bian, Zhilong Jia, Menghan Dou et al.
With the rapid development of classical and quantum machine learning, a large number of machine learning frameworks have been proposed. However, existing machine learning frameworks usually only focus on classical or quantum, rather than both. Therefore, based on VQNet 1.0, we further propose VQNet 2.0, a new generation of unified classical and quantum machine learning framework that supports hybrid optimization. The core library of the framework is implemented in C++, and the user level is implemented in Python, and it supports deployment on quantum and classical hardware. In this article, we analyze the development trend of the new generation machine learning framework and introduce the design principles of VQNet 2.0 in detail: unity, practicality, efficiency, and compatibility, as well as full particulars of implementation. We illustrate the functions of VQNet 2.0 through several basic applications, including classical convolutional neural networks, quantum autoencoders, hybrid classical-quantum networks, etc. After that, through extensive experiments, we demonstrate that the operation speed of VQNet 2.0 is higher than the comparison method. Finally, through extensive experiments, we demonstrate that VQNet 2.0 can deploy on different hardware platforms, the overall calculation speed is faster than the comparison method. It also can be mixed and optimized with quantum circuits composed of multiple quantum computing libraries.
LGApr 4, 2023
OneShotSTL: One-Shot Seasonal-Trend Decomposition For Online Time Series Anomaly Detection And ForecastingXiao He, Ye Li, Jian Tan et al.
Seasonal-trend decomposition is one of the most fundamental concepts in time series analysis that supports various downstream tasks, including time series anomaly detection and forecasting. However, existing decomposition methods rely on batch processing with a time complexity of O(W), where W is the number of data points within a time window. Therefore, they cannot always efficiently support real-time analysis that demands low processing delay. To address this challenge, we propose OneShotSTL, an efficient and accurate algorithm that can decompose time series online with an update time complexity of O(1). OneShotSTL is more than $1,000$ times faster than the batch methods, with accuracy comparable to the best counterparts. Extensive experiments on real-world benchmark datasets for downstream time series anomaly detection and forecasting tasks demonstrate that OneShotSTL is from 10 to over 1,000 times faster than the state-of-the-art methods, while still providing comparable or even better accuracy.
DBApr 13
Ozone: A Unified Platform for Transportation ResearchOu Zheng, Ruyi Feng, Yufeng Yang et al.
Intelligent Transportation Systems increasingly depend on heterogeneous data from roadside cameras, UAV imagery, LiDAR, and in-vehicle sensors, yet the lack of unified data standards, model interfaces, and evaluation protocols across these sources hampers reproducibility, cross-dataset benchmarking, and cross-region transferability of research findings. Existing trajectory datasets follow incompatible conventions for coordinate systems, object representations, and metadata fields, forcing researchers to build custom preprocessing pipelines for each dataset and simulator combination. To address these challenges, we propose Ozone, a unified platform for transportation research organized around five interconnected layers -- Hardware, Data, Model, Evaluation, and Prototype -- each with standardized schemas, automated conversion pipelines, and interoperable interfaces. In the first release, the data schema unifies four trajectory datasets -- NGSIM, highD, CitySim, and UTE -- into a canonical format with oriented bounding boxes, kinematic variables, and pre-computed surrogate safety measures. Digital-twin maps in CARLA and calibrated traffic models provide integrated benchmarking environments. Case studies in human-factor research, traffic scene generation, and safety-critical modeling demonstrate that Ozone reduces experiment setup time by 85%, achieves 91% cross-city transfer efficiency for safety models, and improves cross-dataset reproducibility to within 3% variance. The source code and datasets are publicly available.
CVMar 14, 2023
DAA: A Delta Age AdaIN operation for age estimation via binary code transformerPing Chen, Xingpeng Zhang, Ye Li et al.
Naked eye recognition of age is usually based on comparison with the age of others. However, this idea is ignored by computer tasks because it is difficult to obtain representative contrast images of each age. Inspired by the transfer learning, we designed the Delta Age AdaIN (DAA) operation to obtain the feature difference with each age, which obtains the style map of each age through the learned values representing the mean and standard deviation. We let the input of transfer learning as the binary code of age natural number to obtain continuous age feature information. The learned two groups of values in Binary code mapping are corresponding to the mean and standard deviation of the comparison ages. In summary, our method consists of four parts: FaceEncoder, DAA operation, Binary code mapping, and AgeDecoder modules. After getting the delta age via AgeDecoder, we take the average value of all comparison ages and delta ages as the predicted age. Compared with state-of-the-art methods, our method achieves better performance with fewer parameters on multiple facial age datasets.
IVNov 13, 2023
TTMFN: Two-stream Transformer-based Multimodal Fusion Network for Survival PredictionRuiquan Ge, Xiangyang Hu, Rungen Huang et al.
Survival prediction plays a crucial role in assisting clinicians with the development of cancer treatment protocols. Recent evidence shows that multimodal data can help in the diagnosis of cancer disease and improve survival prediction. Currently, deep learning-based approaches have experienced increasing success in survival prediction by integrating pathological images and gene expression data. However, most existing approaches overlook the intra-modality latent information and the complex inter-modality correlations. Furthermore, existing modalities do not fully exploit the immense representational capabilities of neural networks for feature aggregation and disregard the importance of relationships between features. Therefore, it is highly recommended to address these issues in order to enhance the prediction performance by proposing a novel deep learning-based method. We propose a novel framework named Two-stream Transformer-based Multimodal Fusion Network for survival prediction (TTMFN), which integrates pathological images and gene expression data. In TTMFN, we present a two-stream multimodal co-attention transformer module to take full advantage of the complex relationships between different modalities and the potential connections within the modalities. Additionally, we develop a multi-head attention pooling approach to effectively aggregate the feature representations of the two modalities. The experiment results on four datasets from The Cancer Genome Atlas demonstrate that TTMFN can achieve the best performance or competitive results compared to the state-of-the-art methods in predicting the overall survival of patients.
AIDec 13, 2022
Generative artificial intelligence-enabled dynamic detection of nicotine-related circuitsChangwei Gong, Changhong Jing, Ye Li et al.
The identification of addiction-related circuits is critical for explaining addiction processes and developing addiction treatments. And models of functional addiction circuits developed from functional imaging are an effective tool for discovering and verifying addiction circuits. However, analyzing functional imaging data of addiction and detecting functional addiction circuits still have challenges. We have developed a data-driven and end-to-end generative artificial intelligence(AI) framework to address these difficulties. The framework integrates dynamic brain network modeling and novel network architecture networks architecture, including temporal graph Transformer and contrastive learning modules. A complete workflow is formed by our generative AI framework: the functional imaging data, from neurobiological experiments, and computational modeling, to end-to-end neural networks, is transformed into dynamic nicotine addiction-related circuits. It enables the detection of addiction-related brain circuits with dynamic properties and reveals the underlying mechanisms of addiction.
LGMar 3, 2023
Implicit Stochastic Gradient Descent for Training Physics-informed Neural NetworksYe Li, Song-Can Chen, Sheng-Jun Huang
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems, but they are still trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features. In this paper, we propose to employ implicit stochastic gradient descent (ISGD) method to train PINNs for improving the stability of training process. We heuristically analyze how ISGD overcome stiffness in the gradient flow dynamics of PINNs, especially for problems with multi-scale solutions. We theoretically prove that for two-layer fully connected neural networks with large hidden nodes, randomly initialized ISGD converges to a globally optimal solution for the quadratic loss function. Empirical results demonstrate that ISGD works well in practice and compares favorably to other gradient-based optimization methods such as SGD and Adam, while can also effectively address the numerical stiffness in training dynamics via gradient descent.
CVSep 8, 2024
Mamba-Enhanced Text-Audio-Video Alignment Network for Emotion Recognition in ConversationsXinran Li, Xiaomao Fan, Qingyang Wu et al.
Emotion Recognition in Conversations (ERCs) is a vital area within multimodal interaction research, dedicated to accurately identifying and classifying the emotions expressed by speakers throughout a conversation. Traditional ERC approaches predominantly rely on unimodal cues\-such as text, audio, or visual data\-leading to limitations in their effectiveness. These methods encounter two significant challenges: 1) Consistency in multimodal information. Before integrating various modalities, it is crucial to ensure that the data from different sources is aligned and coherent. 2) Contextual information capture. Successfully fusing multimodal features requires a keen understanding of the evolving emotional tone, especially in lengthy dialogues where emotions may shift and develop over time. To address these limitations, we propose a novel Mamba-enhanced Text-Audio-Video alignment network (MaTAV) for the ERC task. MaTAV is with the advantages of aligning unimodal features to ensure consistency across different modalities and handling long input sequences to better capture contextual multimodal information. The extensive experiments on the MELD and IEMOCAP datasets demonstrate that MaTAV significantly outperforms existing state-of-the-art methods on the ERC task with a big margin.
LGNov 30, 2022
VI-PINNs: Variance-involved Physics-informed Neural Networks for Fast and Accurate Prediction of Partial Differential EquationsBin Shan, Ye Li, Shengjun Huang
Although physics-informed neural networks(PINNs) have progressed a lot in many real applications recently, there remains problems to be further studied, such as achieving more accurate results, taking less training time, and quantifying the uncertainty of the predicted results. Recent advances in PINNs have indeed significantly improved the performance of PINNs in many aspects, but few have considered the effect of variance in the training process. In this work, we take into consideration the effect of variance and propose our VI-PINNs to give better predictions. We output two values in the final layer of the network to represent the predicted mean and variance respectively, and the latter is used to represent the uncertainty of the output. A modified negative log-likelihood loss and an auxiliary task are introduced for fast and accurate training. We perform several experiments on a wide range of different problems to highlight the advantages of our approach. The results convey that our method not only gives more accurate predictions but also converges faster.
LGSep 1, 2024
Knowledge-data fusion oriented traffic state estimation: A stochastic physics-informed deep learning approachTing Wang, Ye Li, Rongjun Cheng et al.
Physics-informed deep learning (PIDL)-based models have recently garnered remarkable success in traffic state estimation (TSE). However, the prior knowledge used to guide regularization training in current mainstream architectures is based on deterministic physical models. The drawback is that a solely deterministic model fails to capture the universally observed traffic flow dynamic scattering effect, thereby yielding unreliable outcomes for traffic control. This study, for the first time, proposes stochastic physics-informed deep learning (SPIDL) for traffic state estimation. The idea behind such SPIDL is simple and is based on the fact that a stochastic fundamental diagram provides the entire range of possible speeds for any given density with associated probabilities. Specifically, we select percentile-based fundamental diagram and distribution-based fundamental diagram as stochastic physics knowledge, and design corresponding physics-uninformed neural networks for effective fusion, thereby realizing two specific SPIDL models, namely \text{$α$}-SPIDL and \text{$\cal B$}-SPIDL. The main contribution of SPIDL lies in addressing the "overly centralized guidance" caused by the one-to-one speed-density relationship in deterministic models during neural network training, enabling the network to digest more reliable knowledge-based constraints.Experiments on the real-world dataset indicate that proposed SPIDL models achieve accurate traffic state estimation in sparse data scenarios. More importantly, as expected, SPIDL models reproduce well the scattering effect of field observations, demonstrating the effectiveness of fusing stochastic physics model knowledge with deep learning frameworks.
ROFeb 12, 2024Code
Customizable Perturbation Synthesis for Robust SLAM BenchmarkingXiaohao Xu, Tianyi Zhang, Sibo Wang et al.
Robustness is a crucial factor for the successful deployment of robots in unstructured environments, particularly in the domain of Simultaneous Localization and Mapping (SLAM). Simulation-based benchmarks have emerged as a highly scalable approach for robustness evaluation compared to real-world data collection. However, crafting a challenging and controllable noisy world with diverse perturbations remains relatively under-explored. To this end, we propose a novel, customizable pipeline for noisy data synthesis, aimed at assessing the resilience of multi-modal SLAM models against various perturbations. This pipeline incorporates customizable hardware setups, software components, and perturbed environments. In particular, we introduce comprehensive perturbation taxonomy along with a perturbation composition toolbox, allowing the transformation of clean simulations into challenging noisy environments. Utilizing the pipeline, we instantiate the Robust-SLAM benchmark, which includes diverse perturbation types, to evaluate the risk tolerance of existing advanced multi-modal SLAM models. Our extensive analysis uncovers the susceptibilities of existing SLAM models to real-world disturbance, despite their demonstrated accuracy in standard benchmarks. Our perturbation synthesis toolbox, SLAM robustness evaluation pipeline, and Robust-SLAM benchmark will be made publicly available at https://github.com/Xiaohao-Xu/SLAM-under-Perturbation/.
CLFeb 26, 2024Code
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual PropertyShiwen Ni, Minghuan Tan, Yuelin Bai et al.
Large language models (LLMs) have demonstrated impressive performance in various natural language processing (NLP) tasks. However, there is limited understanding of how well LLMs perform in specific domains (e.g, the intellectual property (IP) domain). In this paper, we contribute a new benchmark, the first Multilingual-oriented quiZ on Intellectual Property (MoZIP), for the evaluation of LLMs in the IP domain. The MoZIP benchmark includes three challenging tasks: IP multiple-choice quiz (IPQuiz), IP question answering (IPQA), and patent matching (PatentMatch). In addition, we also develop a new IP-oriented multilingual large language model (called MoZi), which is a BLOOMZ-based model that has been supervised fine-tuned with multilingual IP-related text data. We evaluate our proposed MoZi model and four well-known LLMs (i.e., BLOOMZ, BELLE, ChatGLM and ChatGPT) on the MoZIP benchmark. Experimental results demonstrate that MoZi outperforms BLOOMZ, BELLE and ChatGLM by a noticeable margin, while it had lower scores compared with ChatGPT. Notably, the performance of current LLMs on the MoZIP benchmark has much room for improvement, and even the most powerful ChatGPT does not reach the passing level. Our source code, data, and models are available at \url{https://github.com/AI-for-Science/MoZi}.
CVApr 18
Bias-constrained multimodal intelligence for equitable and reliable clinical AICheng Li, Weijian Huang, Jiarun Liu et al.
The integration of medical imaging and clinical text has enabled the emergence of generalist artificial intelligence (AI) systems for healthcare. However, pervasive biases, such as imbalanced disease prevalence, skewed anatomical region distributions, heterogeneous imaging protocols, and demographic disparities, pose significant challenges to the fairness and reliability of vision-language systems in real-world clinical settings. Here we present BiasCareVL, a bias-aware multimodal learning framework that introduces bias control directly into model design, rather than treating it as a post hoc correction. BiasCareVL incorporates adaptive uncertainty modeling with optional human-in-the-loop refinement to regulate the influence of dominant data patterns and to promote equitable reasoning under distributional imbalance. Trained on 3.44 million samples spanning over 15 imaging modalities, the framework supports diverse clinical tasks, including visual question answering, disease classification, segmentation, and report generation within a unified representation space. Across eight public benchmarks covering dermatology, oncology, radiology, and pathology, BiasCareVL consistently outperforms 20 state-of-the-art methods, with pronounced gains in clinically challenging scenarios, including over 10% accuracy improvement in multi-class skin lesion diagnosis and more than 20% Dice improvement in small tumor segmentation. Furthermore, BiasCareVL achieves diagnostic performance exceeding human accuracy with substantially reduced time requirements when evaluated with board-certified radiologists. By open-sourcing BiasCareVL, we aim to promote a transparent, reproducible, and equitable future for AI in healthcare, paving the way for general-purpose, trustworthy, and clinically reliable AI systems.
LGDec 8, 2022
Physics-guided Data Augmentation for Learning the Solution Operator of Linear Differential EquationsYe Li, Yiwen Pang, Bin Shan
Neural networks, especially the recent proposed neural operator models, are increasingly being used to find the solution operator of differential equations. Compared to traditional numerical solvers, they are much faster and more efficient in practical applications. However, one critical issue is that training neural operator models require large amount of ground truth data, which usually comes from the slow numerical solvers. In this paper, we propose a physics-guided data augmentation (PGDA) method to improve the accuracy and generalization of neural operator models. Training data is augmented naturally through the physical properties of differential equations such as linearity and translation. We demonstrate the advantage of PGDA on a variety of linear differential equations, showing that PGDA can improve the sample complexity and is robust to distributional shift.
CVDec 24, 2025
DGSAN: Dual-Graph Spatiotemporal Attention Network for Pulmonary Nodule Malignancy PredictionXiao Yu, Zhaojie Fang, Guanyu Zhou et al.
Lung cancer continues to be the leading cause of cancer-related deaths globally. Early detection and diagnosis of pulmonary nodules are essential for improving patient survival rates. Although previous research has integrated multimodal and multi-temporal information, outperforming single modality and single time point, the fusion methods are limited to inefficient vector concatenation and simple mutual attention, highlighting the need for more effective multimodal information fusion. To address these challenges, we introduce a Dual-Graph Spatiotemporal Attention Network, which leverages temporal variations and multimodal data to enhance the accuracy of predictions. Our methodology involves developing a Global-Local Feature Encoder to better capture the local, global, and fused characteristics of pulmonary nodules. Additionally, a Dual-Graph Construction method organizes multimodal features into inter-modal and intra-modal graphs. Furthermore, a Hierarchical Cross-Modal Graph Fusion Module is introduced to refine feature integration. We also compiled a novel multimodal dataset named the NLST-cmst dataset as a comprehensive source of support for related research. Our extensive experiments, conducted on both the NLST-cmst and curated CSTL-derived datasets, demonstrate that our DGSAN significantly outperforms state-of-the-art methods in classifying pulmonary nodules with exceptional computational efficiency.
CPDec 7, 2022
Bi-LSTM Price Prediction based on Attention MechanismJiashu Lou, Leyi Cui, Ye Li
With the increasing enrichment and development of the financial derivatives market, the frequency of transactions is also faster and faster. Due to human limitations, algorithms and automatic trading have recently become the focus of discussion. In this paper, we propose a bidirectional LSTM neural network based on an attention mechanism, which is based on two popular assets, gold and bitcoin. In terms of Feature Engineering, on the one hand, we add traditional technical factors, and at the same time, we combine time series models to develop factors. In the selection of model parameters, we finally chose a two-layer deep learning network. According to AUC measurement, the accuracy of bitcoin and gold is 71.94% and 73.03% respectively. Using the forecast results, we achieved a return of 1089.34% in two years. At the same time, we also compare the attention Bi-LSTM model proposed in this paper with the traditional model, and the results show that our model has the best performance in this data set. Finally, we discuss the significance of the model and the experimental results, as well as the possible improvement direction in the future.
CLMar 24
Set-Valued Prediction for Large Language Models with Feasibility-Aware Coverage GuaranteesYe Li, Anqi Hu, Yuanchang Ye et al.
Large language models (LLMs) inherently operate over a large generation space, yet conventional usage typically reports the most likely generation (MLG) as a point prediction, which underestimates the model's capability: although the top-ranked response can be incorrect, valid answers may still exist within the broader output space and can potentially be discovered through repeated sampling. This observation motivates moving from point prediction to set-valued prediction, where the model produces a set of candidate responses rather than a single MLG. In this paper, we propose a principled framework for set-valued prediction, which provides feasibility-aware coverage guarantees. We show that, given the finite-sampling nature of LLM generation, coverage is not always achievable: even with multiple samplings, LLMs may fail to yield an acceptable response for certain questions within the sampled candidate set. To address this, we establish a minimum achievable risk level (MRL), below which statistical coverage guarantees cannot be satisfied. Building on this insight, we then develop a data-driven calibration procedure that constructs prediction sets from sampled responses by estimating a rigorous threshold, ensuring that the resulting set contains a correct answer with a desired probability whenever the target risk level is feasible. Extensive experiments on six language generation tasks with five LLMs demonstrate both the statistical validity and the predictive efficiency of our framework.
CVDec 6, 2023Code
ShareCMP: Polarization-Aware RGB-P Semantic SegmentationZhuoyan Liu, Bo Wang, Lizhi Wang et al.
Multimodal semantic segmentation is developing rapidly, but the modality of RGB-\textbf{P}olarization remains underexplored. To delve into this problem, we construct a UPLight RGB-P segmentation benchmark with 12 typical underwater semantic classes. In this work, we design the ShareCMP, an RGB-P semantic segmentation framework with a shared dual-branch architecture (ShareCMP Encoder), which reduces the parameters and memory space by about 33.8\% compared to previous dual-branch models. It encompasses a Polarization Generate Attention (PGA) module designed to generate polarization modal images with richer polarization properties for the encoder. In addition, we introduce the Class Polarization-Aware Loss (CPALoss) with Class Polarization-Aware Auxiliary Head (CPAAHead) to improve the learning and understanding of the encoder for polarization modal information and to optimize the PGA module. With extensive experiments on a total of three RGB-P benchmarks, our ShareCMP achieves the best performance in mIoU with fewer parameters on the UPLight (92.45{\small (+0.32)}\%), ZJU (92.7{\small (+0.1)}\%), and MCubeS (50.99{\small (+1.51)}\%) datasets. And our ShareCMP (w/o PGA) achieves competitive or even higher performance on other RGB-X datasets compared to the corresponding state-of-the-art RGB-X methods. The code and datasets are available at https://github.com/LEFTeyex/ShareCMP.
ROMar 24
Variable-Resolution Virtual Maps for Autonomous Exploration with Unmanned Surface Vehicles (USVs)Ye Li, Yewei Huang, Wenlong GaoZhang et al.
Autonomous exploration by unmanned surface vehicles (USVs) in near-shore waters requires reliable localisation and consistent mapping over extended areas, but this is challenged by GNSS degradation, environment-induced localisation uncertainty, and limited on-board computation. Virtual map-based methods explicitly model localisation and mapping uncertainty by tightly coupling factor-graph SLAM with a map uncertainty criterion. However, their storage and computational costs scale poorly with fixed-resolution workspace discretisations, leading to inefficiency in large near-shore environments. Moreover, overvaluing feature-sparse open-water regions can increase the risk of SLAM failure as a result of imbalance between exploration and exploitation. To address these limitations, we propose a Variable-Resolution Virtual Map (VRVM), a computationally efficient method for representing map uncertainty using bivariate Gaussian virtual landmarks placed in the cells of an adaptive quadtree. The adaptive quadtree enables an area-weighted uncertainty representation that keeps coarse, far-field virtual landmarks deliberately uncertain while allocating higher resolution to information-dense regions, and reduces the sensitivity of the map valuation to local refinements of the tree. An expectation-maximisation (EM) planner is adopted to evaluate pose and map uncertainty along frontiers using the VRVM, balancing exploration and exploitation. We evaluate VRVM against several state-of-the-art exploration algorithms in the VRX Gazebo simulator, using a realistic marina environment across different testing scenarios with an increasing level of exploration difficulty. The results indicate that our method offers safer behaviour and better utilisation of on-board computation in GNSS-degraded near-shore environments.
LGApr 6Code
An End-to-End Framework for Building Large Language Models for Software OperationsJingkai He, Pengfei Chen, Chenghui Wu et al.
In the field of software operations, Large Language Models (LLMs) have attracted increasing attention. However, existing research has not yet achieved efficient and effective end-to-end intelligent operations due to low-quality data, fragmented knowledge and insufficient learning. To explore the potential of LLMs in software operations, we propose OpsLLM, a domain-specific LLM that supports both knowledge-based question answering (QA) and root cause analysis (RCA). Moreover, we disclose the detailed workflow for building LLMs specifically in the software operations domain. First, a Human-in-the-Loop mechanism is introduced to curate highquality data from a large collection of operational raw data and construct a fine-tuning dataset. Then, based on the data, supervised fine-tuning is conducted to achieve a base model. Furthermore, we introduce a domain process reward model (DPRM) during the reinforcement learning stage to optimize the accuracy and reliability of the fine-tuned model on RCA tasks. Experimental results on the tasks with diverse difficulties demonstrate that OpsLLMs effectively learns and aligns with the operational domain knowledge infused, outperforming existing open-source and closed-source LLMs in accuracy with improvements of 0.2%~5.7% on QA tasks and 2.7% ~70.3% on RCA tasks, while exhibiting strong transferability. Moreover, we will open-source three versions of OpsLLM with 7B, 14B and 32B parameters, along with a 15K fine-tuning dataset.
AIJan 10, 2024Code
Graph-of-Thought: Utilizing Large Language Models to Solve Complex and Dynamic Business ProblemsYe Li
This paper presents Graph-of-Thought (GoT), a new model for workflow automation that enhances the flexibility and efficiency of Large Language Models (LLMs) in complex task execution. GoT advances beyond traditional linear and tree-like cognitive models with a graph structure that enables dynamic path selection. The open-source engine GoTFlow demonstrates the practical application of GoT, facilitating automated, data-driven decision-making across various domains. Despite challenges in complexity and transparency, GoTFlow's potential for improving business processes is significant, promising advancements in both efficiency and decision quality with continuous development.
CVFeb 25, 2025Code
LightFC-X: Lightweight Convolutional Tracker for RGB-X TrackingYunfeng Li, Bo Wang, Ye Li
Despite great progress in multimodal tracking, these trackers remain too heavy and expensive for resource-constrained devices. To alleviate this problem, we propose LightFC-X, a family of lightweight convolutional RGB-X trackers that explores a unified convolutional architecture for lightweight multimodal tracking. Our core idea is to achieve lightweight cross-modal modeling and joint refinement of the multimodal features and the spatiotemporal appearance features of the target. Specifically, we propose a novel efficient cross-attention module (ECAM) and a novel spatiotemporal template aggregation module (STAM). The ECAM achieves lightweight cross-modal interaction of template-search area integrated feature with only 0.08M parameters. The STAM enhances the model's utilization of temporal information through module fine-tuning paradigm. Comprehensive experiments show that our LightFC-X achieves state-of-the-art performance and the optimal balance between parameters, performance, and speed. For example, LightFC-T-ST outperforms CMD by 4.3% and 5.7% in SR and PR on the LasHeR benchmark, which it achieves 2.6x reduction in parameters and 2.7x speedup. It runs in real-time on the CPU at a speed of 22 fps. The code is available at https://github.com/LiYunfengLYF/LightFC-X.
LGAug 1, 2024
Convergence Analysis of Natural Gradient Descent for Over-parameterized Physics-Informed Neural NetworksXianliang Xu, Ting Du, Wang Kong et al.
In the context of over-parameterization, there is a line of work demonstrating that randomly initialized (stochastic) gradient descent (GD) converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. However, the learning rate of GD for training two-layer neural networks exhibits poor dependence on the sample size and the Gram matrix, leading to a slow training process. In this paper, we show that for training two-layer $\text{ReLU}^3$ Physics-Informed Neural Networks (PINNs), the learning rate can be improved from $\mathcal{O}(λ_0)$ to $\mathcal{O}(1/\|\bm{H}^{\infty}\|_2)$, implying that GD actually enjoys a faster convergence rate. Despite such improvements, the convergence rate is still tied to the least eigenvalue of the Gram matrix, leading to slow convergence. We then develop the positive definiteness of Gram matrices with general smooth activation functions and provide the convergence analysis of natural gradient descent (NGD) in training two-layer PINNs, demonstrating that the learning rate can be $\mathcal{O}(1)$ and at this rate, the convergence rate is independent of the Gram matrix. In particular, for smooth activation functions, the convergence rate of NGD is quadratic. Numerical experiments are conducted to verify our theoretical results.
NIMar 11
A Secure Splitting and Acceleration Strategy for TCP/QUIC in Interplanetary NetworksJianhao Yu, Ye Li, Qingfang Jiang et al.
Interplanetary networks (IPNs) present unique challenges such as extreme delay, high loss, and frequent disruptions that severely degrade the performance of conventional transport protocols like Transmission Control Protocol (TCP) and Quick UDP Internet Connection (QUIC). To address these issues, we propose a secure transport acceleration strategy tailored for IPNs. This strategy is founded on our Non-Transparent Secure Proxy (NTSP) architecture, which enables connection splitting for end-to-end encrypted flows while preserving application layer security. Based on the NTSP, we design an IPN-aware transport policy that combines (i) a rate-based congestion control algorithm exploiting the pre-scheduled nature of deep-space links to achieve stable and efficient bandwidth utilization, and (ii) an adaptive packet-level forward error correction scheme to provide low-latency loss recovery without retransmissions. Furthermore, we introduce a theoretically grounded backpressure flow control mechanism, deriving an analytical model for optimal buffer sizing to mitigate rate mismatch and prevent bufferbloat. The strategy is implemented in a prototype system, PEPspace, and evaluated in representative Earth-Moon scenarios. Results show near-capacity and stable goodput and substantially improved delivery performance compared with TCP/QUIC variants and existing Performance Enhancing Proxies, while maintaining low latency and robust data delivery across intermittent links. The NTSP architecture is further discussed as a foundational framework for future unified IP/DTN architectures, bridging a key architectural gap in heterogeneous space networks.
LGSep 7, 2024
Component Fourier Neural Operator for Singularly Perturbed Differential EquationsYe Li, Ting Du, Yiwen Pang et al.
Solving Singularly Perturbed Differential Equations (SPDEs) poses computational challenges arising from the rapid transitions in their solutions within thin regions. The effectiveness of deep learning in addressing differential equations motivates us to employ these methods for solving SPDEs. In this manuscript, we introduce Component Fourier Neural Operator (ComFNO), an innovative operator learning method that builds upon Fourier Neural Operator (FNO), while simultaneously incorporating valuable prior knowledge obtained from asymptotic analysis. Our approach is not limited to FNO and can be applied to other neural network frameworks, such as Deep Operator Network (DeepONet), leading to potential similar SPDEs solvers. Experimental results across diverse classes of SPDEs demonstrate that ComFNO significantly improves accuracy compared to vanilla FNO. Furthermore, ComFNO exhibits natural adaptability to diverse data distributions and performs well in few-shot scenarios, showcasing its excellent generalization ability in practical situations.
AISep 25, 2025Code
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful ReasoningTianrun Xu, Haoda Jing, Ye Li et al.
Recent advances in multimodal language models (MLLMs) have achieved remarkable progress in vision-language reasoning, especially with the emergence of "thinking with images," which integrates explicit visual steps into the reasoning process. While this paradigm strengthens image-based reasoning, a significant challenge remains: models may arrive at correct answers by relying on irrelevant or spurious regions, driven by prior knowledge or dataset biases. Even when the answer is correct, flawed reasoning indicates that the model has not truly understood the image, highlighting the critical importance of reasoning fidelity in multimodal tasks. To address this issue, we propose DeFacto, a counterfactual reasoning framework that jointly enforces accurate answering and faithful reasoning. A key component of our approach is the design of three complementary training paradigms: (i) positive, (ii) counterfactual, and (iii) random-masking. To enable these paradigms, we develop a pipeline that automatically localizes question-relevant evidence and constructs positive, counterfactual, and random variants, resulting in a dataset of about 100k images. Building on this framework, we train multimodal language models with GRPO-based reinforcement learning, where we design three complementary rewards to guide the model toward accurate answering and evidence-grounded reasoning. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and reasoning faithfulness, establishing a stronger foundation for interpretable multimodal reasoning. The code is available on GitHub and the dataset is released on HuggingFace.
CVApr 22, 2025Code
SonarT165: A Large-scale Benchmark and STFTrack Framework for Acoustic Object TrackingYunfeng Li, Bo Wang, Jiahao Wan et al.
Underwater observation systems typically integrate optical cameras and imaging sonar systems. When underwater visibility is insufficient, only sonar systems can provide stable data, which necessitates exploration of the underwater acoustic object tracking (UAOT) task. Previous studies have explored traditional methods and Siamese networks for UAOT. However, the absence of a unified evaluation benchmark has significantly constrained the value of these methods. To alleviate this limitation, we propose the first large-scale UAOT benchmark, SonarT165, comprising 165 square sequences, 165 fan sequences, and 205K high-quality annotations. Experimental results demonstrate that SonarT165 reveals limitations in current state-of-the-art SOT trackers. To address these limitations, we propose STFTrack, an efficient framework for acoustic object tracking. It includes two novel modules, a multi-view template fusion module (MTFM) and an optimal trajectory correction module (OTCM). The MTFM module integrates multi-view feature of both the original image and the binary image of the dynamic template, and introduces a cross-attention-like layer to fuse the spatio-temporal target representations. The OTCM module introduces the acoustic-response-equivalent pixel property and proposes normalized pixel brightness response scores, thereby suppressing suboptimal matches caused by inaccurate Kalman filter prediction boxes. To further improve the model feature, STFTrack introduces a acoustic image enhancement method and a Frequency Enhancement Module (FEM) into its tracking pipeline. Comprehensive experiments show the proposed STFTrack achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/SonarT165.
CVJun 24, 2024Code
From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness BenchmarkingXiaohao Xu, Tianyi Zhang, Sibo Wang et al.
Embodied agents require robust navigation systems to operate in unstructured environments, making the robustness of Simultaneous Localization and Mapping (SLAM) models critical to embodied agent autonomy. While real-world datasets are invaluable, simulation-based benchmarks offer a scalable approach for robustness evaluations. However, the creation of a challenging and controllable noisy world with diverse perturbations remains under-explored. To this end, we propose a novel, customizable pipeline for noisy data synthesis, aimed at assessing the resilience of multi-modal SLAM models against various perturbations. The pipeline comprises a comprehensive taxonomy of sensor and motion perturbations for embodied multi-modal (specifically RGB-D) sensing, categorized by their sources and propagation order, allowing for procedural composition. We also provide a toolbox for synthesizing these perturbations, enabling the transformation of clean environments into challenging noisy simulations. Utilizing the pipeline, we instantiate the large-scale Noisy-Replica benchmark, which includes diverse perturbation types, to evaluate the risk tolerance of existing advanced RGB-D SLAM models. Our extensive analysis uncovers the susceptibilities of both neural (NeRF and Gaussian Splatting -based) and non-neural SLAM models to disturbances, despite their demonstrated accuracy in standard benchmarks. Our code is publicly available at https://github.com/Xiaohao-Xu/SLAM-under-Perturbation.
CVJun 11, 2024Code
RGB-Sonar Tracking Benchmark and Spatial Cross-Attention Transformer TrackerYunfeng Li, Bo Wang, Jiuran Sun et al.
Vision camera and sonar are naturally complementary in the underwater environment. Combining the information from two modalities will promote better observation of underwater targets. However, this problem has not received sufficient attention in previous research. Therefore, this paper introduces a new challenging RGB-Sonar (RGB-S) tracking task and investigates how to achieve efficient tracking of an underwater target through the interaction of RGB and sonar modalities. Specifically, we first propose an RGBS50 benchmark dataset containing 50 sequences and more than 87000 high-quality annotated bounding boxes. Experimental results show that the RGBS50 benchmark poses a challenge to currently popular SOT trackers. Second, we propose an RGB-S tracker called SCANet, which includes a spatial cross-attention module (SCAM) consisting of a novel spatial cross-attention layer and two independent global integration modules. The spatial cross-attention is used to overcome the problem of spatial misalignment of between RGB and sonar images. Third, we propose a SOT data-based RGB-S simulation training method (SRST) to overcome the lack of RGB-S training datasets. It converts RGB images into sonar-like saliency images to construct pseudo-data pairs, enabling the model to learn the semantic structure of RGB-S-like data. Comprehensive experiments show that the proposed spatial cross-attention effectively achieves the interaction between RGB and sonar modalities and SCANet achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/RGBS50.
CVMay 6, 2024Code
Transformer-based RGB-T Tracking with Channel and Spatial Feature FusionYunfeng Li, Bo Wang, Ye Li
The main problem in RGB-T tracking is the correct and optimal merging of the cross-modal features of visible and thermal images. Some previous methods either do not fully exploit the potential of RGB and TIR information for channel and spatial feature fusion or lack a direct interaction between the template and the search area, which limits the model's ability to fully utilize the original semantic information of both modalities. To address these limitations, we investigate how to achieve a direct fusion of cross-modal channels and spatial features in RGB-T tracking and propose CSTNet. It uses the Vision Transformer (ViT) as the backbone and adds a Joint Spatial and Channel Fusion Module (JSCFM) and Spatial Fusion Module (SFM) integrated between the transformer blocks to facilitate cross-modal feature interaction. The JSCFM module achieves joint modeling of channel and multi-level spatial features. The SFM module includes a cross-attention-like architecture for cross modeling and joint learning of RGB and TIR features. Comprehensive experiments show that CSTNet achieves state-of-the-art performance. To enhance practicality, we retrain the model without JSCFM and SFM modules and use CSNet as the pretraining weight, and propose CSTNet-small, which achieves 50% speedup with an average decrease of 1-2% in SR and PR performance. CSTNet and CSTNet-small achieve real-time speeds of 21 fps and 33 fps on the Nvidia Jetson Xavier, meeting actual deployment requirements. Code is available at https://github.com/LiYunfengLYF/CSTNet.
LGAug 15, 2024
Inversion-DeepONet: A Novel DeepONet-Based Network with Encoder-Decoder for Full Waveform InversionZekai Guo, Lihui Chai, Shengjun Huang et al.
Full waveform inversion (FWI) plays a crucial role in the field of geophysics. There has been lots of research about applying deep learning (DL) methods to FWI. The success of DL-FWI relies significantly on the quantity and diversity of the datasets. Nevertheless, existing FWI datasets, like OpenFWI, where sources have fixed locations or identical frequencies, provide limited information and do not represent the complex real-world scene. For instance, low frequencies help in resolving larger-scale structures. High frequencies allow for a more detailed subsurface features. %A single source frequency is insufficient to describe subsurface structural properties. We consider that simultaneously using sources with different frequencies, instead of performing inversion using low frequencies data and then gradually introducing higher frequencies data, has rationale and potential advantages. Hence, we develop three enhanced datasets based on OpenFWI where each source have varying locations, frequencies or both. Moreover, we propose a novel deep operator network (DeepONet) architecture Inversion-DeepONet for FWI. We utilize convolutional neural network (CNN) to extract the features from seismic data in branch net. Source parameters, such as locations and frequencies, are fed to trunk net. Then another CNN is employed as the decoder of DeepONet to reconstruct the velocity models more effectively. Through experiments, we confirm the superior performance on accuracy and generalization ability of our network, compared with existing data-driven FWI methods.
LGSep 7, 2024
A Multi-scenario Attention-based Generative Model for Personalized Blood Pressure Time Series ForecastingCheng Wan, Chenjie Xie, Longfei Liu et al.
Continuous blood pressure (BP) monitoring is essential for timely diagnosis and intervention in critical care settings. However, BP varies significantly across individuals, this inter-patient variability motivates the development of personalized models tailored to each patient's physiology. In this work, we propose a personalized BP forecasting model mainly using electrocardiogram (ECG) and photoplethysmogram (PPG) signals. This time-series model incorporates 2D representation learning to capture complex physiological relationships. Experiments are conducted on datasets collected from three diverse scenarios with BP measurements from 60 subjects total. Results demonstrate that the model achieves accurate and robust BP forecasts across scenarios within the Association for the Advancement of Medical Instrumentation (AAMI) standard criteria. This reliable early detection of abnormal fluctuations in BP is crucial for at-risk patients undergoing surgery or intensive care. The proposed model provides a valuable addition for continuous BP tracking to reduce mortality and improve prognosis.
CVApr 24
Unlocking Optical Prior: Spectrum-Guided Knowledge Transfer for SAR Generalized Category DiscoveryJingyuan Xia, Ruikang Hu, Ye Li et al.
Generalized Category Discovery (GCD) holds significant promise for the label-scarce Synthetic Aperture Radar (SAR) domain, yet its efficacy is severely constrained by the cross-modal incompatibility between the inherent optical prior of the Large Vision Models (LVMs) and SAR imagery. Existing domain adaptation methods often lack an inductive bias that reflects imaging characteristics, consequently failing to effectively transfer optical prior into the SAR domain. To address this issue, the Modal Discrepancy Curve (MDC) is introduced to model cross-modal discrepancy as a structured frequency-domain descriptor derived from spectral energy distributions. Leveraging this formulation, we propose the MDC-guided Cross-modal Prior Transfer (MCPT) framework, a pre-training paradigm that operates on paired optical-SAR data. Within this framework, Adaptive Frequency Tokenization (AFT) converts the MDC into learnable tokens, and Frequency-aware Expert Refinement (FER) performs band-wise discrepancy-aware feature refinement using these tokens. Based on the refined representations, contrastive learning aligns refined embeddings across modalities and internalizes the adaptation pattern. Ultimately, the superior SAR feature representation capability learned during paired pre-training is applied to downstream single-modal SAR-GCD tasks. Extensive experiments demonstrate state-of-the-art performance across multiple mainstream datasets, indicating that frequency-domain discrepancy modeling enables more effective adaptation of optical prior to SAR imagery.
LGJul 3, 2024
Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural NetworksXianliang Xu, Ting Du, Wang Kong et al.
The optimization algorithms are crucial in training physics-informed neural networks (PINNs), as unsuitable methods may lead to poor solutions. Compared to the common gradient descent (GD) algorithm, implicit gradient descent (IGD) outperforms it in handling certain multi-scale problems. In this paper, we provide convergence analysis for the IGD in training over-parameterized two-layer PINNs. We first derive the training dynamics of IGD in training two-layer PINNs. Then, over-parameterization allows us to prove that the randomly initialized IGD converges to a globally optimal solution at a linear convergence rate. Moreover, due to the distinct training dynamics of IGD compared to GD, the learning rate can be selected independently of the sample size and the least eigenvalue of the Gram matrix. Additionally, the novel approach used in our convergence analysis imposes a milder requirement on the network width. Finally, empirical results validate our theoretical findings.