NEJun 23, 2022
A Survey on Learnable Evolutionary Algorithms for Scalable Multiobjective OptimizationSongbai Liu, Qiuzhen Lin, Jianqiang Li et al.
Recent decades have witnessed great advancements in multiobjective evolutionary algorithms (MOEAs) for multiobjective optimization problems (MOPs). However, these progressively improved MOEAs have not necessarily been equipped with scalable and learnable problem-solving strategies for new and grand challenges brought by the scaling-up MOPs with continuously increasing complexity from diverse aspects, mainly including expensive cost of function evaluations, many objectives, large-scale search space, time-varying environments, and multi-task. Under different scenarios, divergent thinking is required in designing new powerful MOEAs for solving them effectively. In this context, research studies on learnable MOEAs with machine learning techniques have received extensive attention in the field of evolutionary computation. This paper begins with a general taxonomy of scaling-up MOPs and learnable MOEAs, followed by an analysis of the challenges that these MOPs pose to traditional MOEAs. Then, we synthetically overview recent advances of learnable MOEAs in solving various scaling-up MOPs, focusing primarily on four attractive directions (i.e., learnable evolutionary discriminators for environmental selection, learnable evolutionary generators for reproduction, learnable evolutionary evaluators for function evaluations, and learnable evolutionary transfer modules for sharing or reusing optimization experience). The insight of learnable MOEAs is offered to readers as a reference to the general track of the efforts in this field.
LGAug 29, 2024Code
An Exploratory Deep Learning Approach for Predicting Subsequent Suicidal Acts in Chinese Psychological Support HotlinesChangwei Song, Qing Zhao, Jianqiang Li et al.
Psychological support hotlines are an effective suicide prevention measure that typically relies on professionals using suicide risk assessment scales to predict individual risk scores. However, the accuracy of scale-based predictive methods for suicide risk assessment can vary widely depending on the expertise of the operator. This limitation underscores the need for more reliable methods, prompting this research's innovative exploration of the use of artificial intelligence to improve the accuracy and efficiency of suicide risk prediction within the context of psychological support hotlines. The study included data from 1,549 subjects from 2015-2017 in China who contacted a psychological support hotline. Each participant was followed for 12 months to identify instances of suicidal behavior. We proposed a novel multi-task learning method that uses the large-scale pre-trained model Whisper for feature extraction and fits psychological scales while predicting the risk of suicide. The proposed method yields a 2.4\% points improvement in F1-score compared to the traditional manual approach based on the psychological scales. Our model demonstrated superior performance compared to the other eight popular models. To our knowledge, this study is the first to apply deep learning to long-term speech data to predict suicide risk in China, indicating grate potential for clinical applications. The source code is publicly available at: \url{https://github.com/songchangwei/Suicide-Risk-Prediction}.
CVNov 18, 2023Code
Morphology-Enhanced CAM-Guided SAM for weakly supervised Breast Lesion SegmentationXin Yue, Xiaoling Liu, Qing Zhao et al.
Ultrasound imaging plays a critical role in the early detection of breast cancer. Accurate identification and segmentation of lesions are essential steps in clinical practice, requiring methods to assist physicians in lesion segmentation. However, ultrasound lesion segmentation models based on supervised learning require extensive manual labeling, which is both time-consuming and labor-intensive. In this study, we present a novel framework for weakly supervised lesion segmentation in early breast ultrasound images. Our method uses morphological enhancement and class activation map (CAM)-guided localization. Finally, we employ the Segment Anything Model (SAM), a computer vision foundation model, for detailed segmentation. This approach does not require pixel-level annotation, thereby reducing the cost of data annotation. The performance of our method is comparable to supervised learning methods that require manual annotations, achieving a Dice score of 74.39% and outperforming comparative supervised models in terms of Hausdorff distance in the BUSI dataset. These results demonstrate that our framework effectively integrates weakly supervised learning with SAM, providing a promising solution for breast cancer image analysis. The code for this study is available at: https://github.com/YueXin18/MorSeg-CAM-SAM.
AIAug 29, 2023
Enhancing Psychological Counseling with Large Language Model: A Multifaceted Decision-Support System for Non-ProfessionalsGuanghui Fu, Qing Zhao, Jianqiang Li et al.
In the contemporary landscape of social media, an alarming number of users express negative emotions, some of which manifest as strong suicidal intentions. This situation underscores a profound need for trained psychological counselors who can enact effective mental interventions. However, the development of these professionals is often an imperative but time-consuming task. Consequently, the mobilization of non-professionals or volunteers in this capacity emerges as a pressing concern. Leveraging the capabilities of artificial intelligence, and in particular, the recent advances in large language models, offers a viable solution to this challenge. This paper introduces a novel model constructed on the foundation of large language models to fully assist non-professionals in providing psychological interventions on online user discourses. This framework makes it plausible to harness the power of non-professional counselors in a meaningful way. A comprehensive study was conducted involving ten professional psychological counselors of varying expertise, evaluating the system across five critical dimensions. The findings affirm that our system is capable of analyzing patients' issues with relative accuracy and proffering professional-level strategies recommendations, thereby enhancing support for non-professionals. This research serves as a compelling validation of the application of large language models in the field of psychology and lays the groundwork for a new paradigm of community-based mental health support.
ROApr 23Code
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse AnchorsDachong Li, ZhuangZhuang Chen, Jin Zhang et al.
Vision--Language--Action (VLA) models often use intermediate representations to connect multimodal inputs with continuous control, yet spatial guidance is often injected implicitly through latent features. We propose $CorridorVLA$, which predicts sparse spatial anchors as incremental physical changes (e.g., $Δ$-positions) and uses them to impose an explicit tolerance region in the training objective for action generation. The anchors define a corridor that guides a flow-matching action head: trajectories whose implied spatial evolution falls outside it receive corrective gradients, while minor deviations from contacts and execution noise are permitted. On the more challenging LIBERO-Plus benchmark, CorridorVLA yields consistent gains across both SmolVLA and GR00T, improving success rate by $3.4\%$--$12.4\%$ over the corresponding baselines; notably, our GR00T-Corr variant reaches a success rate of $83.21\%$. These results indicate that action-aligned physical cues can provide direct and interpretable constraints for generative action policies, complementing spatial guidance encoded in visual or latent forms. Code is available at https://github.com/corridorVLA.
CVJun 9, 2023
RePaint-NeRF: NeRF Editting via Semantic Masks and Diffusion ModelsXingchen Zhou, Ying He, F. Richard Yu et al.
The emergence of Neural Radiance Fields (NeRF) has promoted the development of synthesized high-fidelity views of the intricate real world. However, it is still a very demanding task to repaint the content in NeRF. In this paper, we propose a novel framework that can take RGB images as input and alter the 3D content in neural scenes. Our work leverages existing diffusion models to guide changes in the designated 3D content. Specifically, we semantically select the target object and a pre-trained diffusion model will guide the NeRF model to generate new 3D objects, which can improve the editability, diversity, and application range of NeRF. Experiment results show that our algorithm is effective for editing 3D objects in NeRF under different text prompts, including editing appearance, shape, and more. We validate our method on both real-world datasets and synthetic-world datasets for these editing tasks. Please visit https://starstesla.github.io/repaintnerf for a better view of our results.
CLSep 7, 2023
Supervised Learning and Large Language Model Benchmarks on Mental Health Datasets: Cognitive Distortions and Suicidal Risks in Chinese Social MediaHongzhi Qi, Qing Zhao, Jianqiang Li et al.
On social media, users often express their personal feelings, which may exhibit cognitive distortions or even suicidal tendencies on certain specific topics. Early recognition of these signs is critical for effective psychological intervention. In this paper, we introduce two novel datasets from Chinese social media: SOS-HL-1K for suicidal risk classification and SocialCD-3K for cognitive distortions detection. The SOS-HL-1K dataset contained 1,249 posts and SocialCD-3K dataset was a multi-label classification dataset that containing 3,407 posts. We propose a comprehensive evaluation using two supervised learning methods and eight large language models (LLMs) on the proposed datasets. From the prompt engineering perspective, we experimented with two types of prompt strategies, including four zero-shot and five few-shot strategies. We also evaluated the performance of the LLMs after fine-tuning on the proposed tasks. The experimental results show that there is still a huge gap between LLMs relying only on prompt engineering and supervised learning. In the suicide classification task, this gap is 6.95% points in F1-score, while in the cognitive distortion task, the gap is even more pronounced, reaching 31.53% points in F1-score. However, after fine-tuning, this difference is significantly reduced. In the suicide and cognitive distortion classification tasks, the gap decreases to 4.31% and 3.14%, respectively. This research highlights the potential of LLMs in psychological contexts, but supervised learning remains necessary for more challenging tasks. All datasets and code are made available.
CVNov 30, 2025Code
Silhouette-based Gait Foundation ModelDingqiang Ye, Chao Fan, Kartik Narayan et al.
Gait patterns play a critical role in human identification and healthcare analytics, yet current progress remains constrained by small, narrowly designed models that fail to scale or generalize. Building a unified gait foundation model requires addressing two longstanding barriers: (a) Scalability. Why have gait models historically failed to follow scaling laws? (b) Generalization. Can one model serve the diverse gait tasks that have traditionally been studied in isolation? We introduce FoundationGait, the first scalable, self-supervised pretraining framework for gait understanding. Its largest version has nearly 0.13 billion parameters and is pretrained on 12 public gait datasets comprising over 2 million walking sequences. Extensive experiments demonstrate that FoundationGait, with or without fine-tuning, performs robustly across a wide spectrum of gait datasets, conditions, tasks (e.g., human identification, scoliosis screening, depression prediction, and attribute estimation), and even input modality. Notably, it achieves 48.0% zero-shot rank-1 accuracy on the challenging in-the-wild Gait3D dataset (1,000 test subjects) and 64.5% on the largest in-the-lab OU-MVLP dataset (5,000+ test subjects), setting a new milestone in robust gait recognition. Coming code and model: https://github.com/ShiqiYu/OpenGait.
LGJan 29Code
From Tokens to Blocks: A Block-Diffusion Perspective on Molecular GenerationQianwei Yang, Dong Xu, Zhangfan Yang et al.
Drug discovery can be viewed as a combinatorial search over an immense chemical space, motivating the development of deep generative models for de novo molecular design. Among these, GPT-based molecular language models (MLM) have shown strong molecular design performance by learning chemical syntax and semantics from large-scale data. However, existing MLMs face two fundamental limitations: they inadequately capture the graph-structured nature of molecules when formulated as next-token prediction problems, and they typically lack explicit mechanisms for target-aware generation. Here, we propose SoftMol, a unified framework that co-designs molecular representation, model architecture, and search strategy for target-aware molecular generation. SoftMol introduces soft fragments, a rule-free block representation of SMILES that enables diffusion-native modeling, and develops SoftBD, the first block-diffusion molecular language model that combines local bidirectional diffusion with autoregressive generation under molecular structural constraints. To favor generated molecules with high drug-likeness and synthetic accessibility, SoftBD is trained on a carefully curated dataset named ZINC-Curated. SoftMol further integrates a gated Monte Carlo tree search to assemble fragments in a target-aware manner. Experimental results show that, compared with current state-of-the-art models, SoftMol achieves 100% chemical validity, improves binding affinity by 9.7%, yields a 2-3x increase in molecular diversity, and delivers a 6.6x speedup in inference efficiency. Code is available at https://github.com/szu-aicourse/softmol
AIJul 28, 2024
A Generic Review of Integrating Artificial Intelligence in Cognitive Behavioral TherapyMeng Jiang, Qing Zhao, Jianqiang Li et al.
Cognitive Behavioral Therapy (CBT) is a well-established intervention for mitigating psychological issues by modifying maladaptive cognitive and behavioral patterns. However, delivery of CBT is often constrained by resource limitations and barriers to access. Advancements in artificial intelligence (AI) have provided technical support for the digital transformation of CBT. Particularly, the emergence of pre-training models (PTMs) and large language models (LLMs) holds immense potential to support, augment, optimize and automate CBT delivery. This paper reviews the literature on integrating AI into CBT interventions. We begin with an overview of CBT. Then, we introduce the integration of AI into CBT across various stages: pre-treatment, therapeutic process, and post-treatment. Next, we summarized the datasets relevant to some CBT-related tasks. Finally, we discuss the benefits and current limitations of applying AI to CBT. We suggest key areas for future research, highlighting the need for further exploration and validation of the long-term efficacy and clinical utility of AI-enhanced CBT. The transformative potential of AI in reshaping the practice of CBT heralds a new era of more accessible, efficient, and personalized mental health interventions.
DSFeb 23
Exploiting Low-Rank Structure in Max-K-Cut ProblemsRia Stevens, Fangshuo Liao, Barbara Su et al.
We approach the Max-3-Cut problem through the lens of maximizing complex-valued quadratic forms and demonstrate that low-rank structure in the objective matrix can be exploited, leading to alternative algorithms to classical semidefinite programming (SDP) relaxations and heuristic techniques. We propose an algorithm for maximizing these quadratic forms over a domain of size $K$ that enumerates and evaluates a set of $O\left(n^{2r-1}\right)$ candidate solutions, where $n$ is the dimension of the matrix and $r$ represents the rank of an approximation of the objective. We prove that this candidate set is guaranteed to include the exact maximizer when $K=3$ (corresponding to Max-3-Cut) and the objective is low-rank, and provide approximation guarantees when the objective is a perturbation of a low-rank matrix. This construction results in a family of novel, inherently parallelizable and theoretically-motivated algorithms for Max-3-Cut. Extensive experimental results demonstrate that our approach achieves performance comparable to existing algorithms across a wide range of graphs, while being highly scalable.
CLSep 10, 2024
Deep Learning and Large Language Models for Audio and Text Analysis in Predicting Suicidal Acts in Chinese Psychological Support HotlinesYining Chen, Jianqiang Li, Changwei Song et al.
Suicide is a pressing global issue, demanding urgent and effective preventive interventions. Among the various strategies in place, psychological support hotlines had proved as a potent intervention method. Approximately two million people in China attempt suicide annually, with many individuals making multiple attempts. Prompt identification and intervention for high-risk individuals are crucial to preventing tragedies. With the rapid advancement of artificial intelligence (AI), especially the development of large-scale language models (LLMs), new technological tools have been introduced to the field of mental health. This study included 1284 subjects, and was designed to validate whether deep learning models and LLMs, using audio and transcribed text from support hotlines, can effectively predict suicide risk. We proposed a simple LLM-based pipeline that first summarizes transcribed text from approximately one hour of speech to extract key features, and then predict suicidial bahaviours in the future. We compared our LLM-based method with the traditional manual scale approach in a clinical setting and with five advanced deep learning models. Surprisingly, the proposed simple LLM pipeline achieved strong performance on a test set of 46 subjects, with an F1 score of 76\% when combined with manual scale rating. This is 7\% higher than the best speech-based deep learning models and represents a 27.82\% point improvement in F1 score compared to using the manual scale apporach alone. Our study explores new applications of LLMs and demonstrates their potential for future use in suicide prevention efforts.
IVFeb 12, 2024Code
Comparative Analysis of ImageNet Pre-Trained Deep Learning Models and DINOv2 in Medical Imaging ClassificationYuning Huang, Jingchen Zou, Lanxi Meng et al.
Medical image analysis frequently encounters data scarcity challenges. Transfer learning has been effective in addressing this issue while conserving computational resources. The recent advent of foundational models like the DINOv2, which uses the vision transformer architecture, has opened new opportunities in the field and gathered significant interest. However, DINOv2's performance on clinical data still needs to be verified. In this paper, we performed a glioma grading task using three clinical modalities of brain MRI data. We compared the performance of various pre-trained deep learning models, including those based on ImageNet and DINOv2, in a transfer learning context. Our focus was on understanding the impact of the freezing mechanism on performance. We also validated our findings on three other types of public datasets: chest radiography, fundus radiography, and dermoscopy. Our findings indicate that in our clinical dataset, DINOv2's performance was not as strong as ImageNet-based pre-trained models, whereas in public datasets, DINOv2 generally outperformed other models, especially when using the frozen mechanism. Similar performance was observed with various sizes of DINOv2 models across different tasks. In summary, DINOv2 is viable for medical image classification tasks, particularly with data resembling natural images. However, its effectiveness may vary with data that significantly differs from natural images such as MRI. In addition, employing smaller versions of the model can be adequate for medical task, offering resource-saving benefits. Our codes are available at https://github.com/GuanghuiFU/medical_DINOv2_eval.
LGJan 30Code
Unveiling Scaling Behaviors in Molecular Language Models: Effects of Model Size, Data, and RepresentationDong Xu, Qihua Pan, Sisi Yuan et al.
Molecular generative models, often employing GPT-style language modeling on molecular string representations, have shown promising capabilities when scaled to large datasets and model sizes. However, it remains unclear and subject to debate whether these models adhere to predictable scaling laws under fixed computational budgets, which is a crucial understanding for optimally allocating resources between model size, data volume, and molecular representation. In this study, we systematically investigate the scaling behavior of molecular language models across both pretraining and downstream tasks. We train 300 models and conduct over 10,000 experiments, rigorously controlling compute budgets while independently varying model size, number of training tokens, and molecular representation. Our results demonstrate clear scaling laws in molecular models for both pretraining and downstream transfer, reveal the substantial impact of molecular representation on performance, and explain previously observed inconsistencies in scaling behavior for molecular generation. Additionally, we publicly release the largest library of molecular language models to date to facilitate future research and development. Code and models are available at https://github.com/SZU-ADDG/MLM-Scaling.
QMJul 24, 2023
DeepGATGO: A Hierarchical Pretraining-Based Graph-Attention Model for Automatic Protein Function PredictionZihao Li, Changkun Jiang, Jianqiang Li
Automatic protein function prediction (AFP) is classified as a large-scale multi-label classification problem aimed at automating protein enrichment analysis to eliminate the current reliance on labor-intensive wet-lab methods. Currently, popular methods primarily combine protein-related information and Gene Ontology (GO) terms to generate final functional predictions. For example, protein sequences, structural information, and protein-protein interaction networks are integrated as prior knowledge to fuse with GO term embeddings and generate the ultimate prediction results. However, these methods are limited by the difficulty in obtaining structural information or network topology information, as well as the accuracy of such data. Therefore, more and more methods that only use protein sequences for protein function prediction have been proposed, which is a more reliable and computationally cheaper approach. However, the existing methods fail to fully extract feature information from protein sequences or label data because they do not adequately consider the intrinsic characteristics of the data itself. Therefore, we propose a sequence-based hierarchical prediction method, DeepGATGO, which processes protein sequences and GO term labels hierarchically, and utilizes graph attention networks (GATs) and contrastive learning for protein function prediction. Specifically, we compute embeddings of the sequence and label data using pre-trained models to reduce computational costs and improve the embedding accuracy. Then, we use GATs to dynamically extract the structural information of non-Euclidean data, and learn general features of the label dataset with contrastive learning by constructing positive and negative example samples. Experimental results demonstrate that our proposed model exhibits better scalability in GO term enrichment analysis on large-scale datasets.
CVApr 11
PhyMix: Towards Physically Consistent Single-Image 3D Indoor Scene Generation with Implicit--Explicit OptimizationDongli Wu, Jingyu Hu, Ka-Hei Hui et al.
Existing single-image 3D indoor scene generators often produce results that look visually plausible but fail to obey real-world physics, limiting their reliability in robotics, embodied AI, and design. To examine this gap, we introduce a unified Physics Evaluator that measures four main aspects: geometric priors, contact, stability, and deployability, which are further decomposed into nine sub-constraints, establishing the first benchmark to measure physical consistency. Based on this evaluator, our analysis shows that state-of-the-art methods remain largely physics-unaware. To overcome this limitation, we further propose a framework that integrates feedback from the Physics Evaluator into both training and inference, enhancing the physical plausibility of generated scenes. Specifically, we propose PhyMix, which is composed of two complementary components: (i) implicit alignment via Scene-GRPO, a critic-free group-relative policy optimization that leverages the Physics Evaluator as a preference signal and biases sampling towards physically feasible layouts, and (ii) explicit refinement via a plug-and-play Test-Time Optimizer (TTO) that uses differentiable evaluator signals to correct residual violations during generation. Overall, our method unifies evaluation, reward shaping, and inference-time correction, producing 3D indoor scenes that are visually faithful and physically plausible. Extensive synthetic evaluations confirm state-of-the-art performance in both visual fidelity and physical plausibility, and extensive qualitative examples in stylized and real-world images further showcase the robustness of the method. We will release codes and models upon publication.
CLFeb 14, 2024Code
Chinese MentalBERT: Domain-Adaptive Pre-training on Social Media for Chinese Mental Health Text AnalysisWei Zhai, Hongzhi Qi, Qing Zhao et al.
In the current environment, psychological issues are prevalent and widespread, with social media serving as a key outlet for individuals to share their feelings. This results in the generation of vast quantities of data daily, where negative emotions have the potential to precipitate crisis situations. There is a recognized need for models capable of efficient analysis. While pre-trained language models have demonstrated their effectiveness broadly, there's a noticeable gap in pre-trained models tailored for specialized domains like psychology. To address this, we have collected a huge dataset from Chinese social media platforms and enriched it with publicly available datasets to create a comprehensive database encompassing 3.36 million text entries. To enhance the model's applicability to psychological text analysis, we integrated psychological lexicons into the pre-training masking mechanism. Building on an existing Chinese language model, we performed adaptive training to develop a model specialized for the psychological domain. We evaluated our model's performance across six public datasets, where it demonstrated improvements compared to eight other models. Additionally, in the qualitative comparison experiment, our model provided psychologically relevant predictions given the masked sentences. Due to concerns regarding data privacy, the dataset will not be made publicly available. However, we have made the pre-trained models and codes publicly accessible to the community via: https://github.com/zwzzzQAQ/Chinese-MentalBERT.
CVJan 23, 2024Code
$ShiftwiseConv:$ Small Convolutional Kernel with Large Kernel EffectDachong Li, Li Li, Zhuangzhuang Chen et al.
Large kernels make standard convolutional neural networks (CNNs) great again over transformer architectures in various vision tasks. Nonetheless, recent studies meticulously designed around increasing kernel size have shown diminishing returns or stagnation in performance. Thus, the hidden factors of large kernel convolution that affect model performance remain unexplored. In this paper, we reveal that the key hidden factors of large kernels can be summarized as two separate components: extracting features at a certain granularity and fusing features by multiple pathways. To this end, we leverage the multi-path long-distance sparse dependency relationship to enhance feature utilization via the proposed Shiftwise (SW) convolution operator with a pure CNN architecture. In a wide range of vision tasks such as classification, segmentation, and detection, SW surpasses state-of-the-art transformers and CNN architectures, including SLaK and UniRepLKNet. More importantly, our experiments demonstrate that $3 \times 3$ convolutions can replace large convolutions in existing large kernel CNNs to achieve comparable effects, which may inspire follow-up works. Code and all the models at https://github.com/lidc54/shift-wiseConv.
CLJan 15, 2025Code
Deep Learning-Based Feature Fusion for Emotion Analysis and Suicide Risk Differentiation in Chinese Psychological Support HotlinesHan Wang, Jianqiang Li, Qing Zhao et al.
Mental health is a critical global public health issue, and psychological support hotlines play a pivotal role in providing mental health assistance and identifying suicide risks at an early stage. However, the emotional expressions conveyed during these calls remain underexplored in current research. This study introduces a method that combines pitch acoustic features with deep learning-based features to analyze and understand emotions expressed during hotline interactions. Using data from China's largest psychological support hotline, our method achieved an F1-score of 79.13% for negative binary emotion classification.Additionally, the proposed approach was validated on an open dataset for multi-class emotion classification,where it demonstrated better performance compared to the state-of-the-art methods. To explore its clinical relevance, we applied the model to analysis the frequency of negative emotions and the rate of emotional change in the conversation, comparing 46 subjects with suicidal behavior to those without. While the suicidal group exhibited more frequent emotional changes than the non-suicidal group, the difference was not statistically significant.Importantly, our findings suggest that emotional fluctuation intensity and frequency could serve as novel features for psychological assessment scales and suicide risk prediction.The proposed method provides valuable insights into emotional dynamics and has the potential to advance early intervention and improve suicide prevention strategies through integration with clinical tools and assessments The source code is publicly available at https://github.com/Sco-field/Speechemotionrecognition/tree/main.
CVNov 1, 2024Code
Topology and Intersection-Union Constrained Loss Function for Multi-Region Anatomical Segmentation in Ocular ImagesRuiyu Xia, Jianqiang Li, Xi Xu et al.
Ocular Myasthenia Gravis (OMG) is a rare and challenging disease to detect in its early stages, but symptoms often first appear in the eye muscles, such as drooping eyelids and double vision. Ocular images can be used for early diagnosis by segmenting different regions, such as the sclera, iris, and pupil, which allows for the calculation of area ratios to support accurate medical assessments. However, no publicly available dataset and tools currently exist for this purpose. To address this, we propose a new topology and intersection-union constrained loss function (TIU loss) that improves performance using small training datasets. We conducted experiments on a public dataset consisting of 55 subjects and 2,197 images. Our proposed method outperformed two widely used loss functions across three deep learning networks, achieving a mean Dice score of 83.12% [82.47%, 83.81%] with a 95% bootstrap confidence interval. In a low-percentage training scenario (10% of the training data), our approach showed an 8.32% improvement in Dice score compared to the baseline. Additionally, we evaluated the method in a clinical setting with 47 subjects and 501 images, achieving a Dice score of 64.44% [63.22%, 65.62%]. We did observe some bias when applying the model in clinical settings. These results demonstrate that the proposed method is accurate, and our code along with the trained model is publicly available.
ROMar 31
ICAT: Incident-Case-Grounded Adaptive Testing for Physical-Risk Prediction in Embodied World ModelsZhenglin Lai, Sirui Huang, Yuteng Li et al.
Video-generative world models are increasingly used as neural simulators for embodied planning and policy learning, yet their ability to predict physical risk and severe consequences is rarely evaluated.We find that these models often downplay or omit key danger cues and severe outcomes for hazardous actions, which can induce unsafe preferences during planning and training on imagined rollouts. We propose ICAT, which grounds testing in real incident reports and safety manuals by building structured risk memories and retrieving/composing them to constrain the generation of risk cases with causal chains and severity labels. Experiments on an ICAT-based benchmark show that mainstream world models frequently miss mechanisms and triggering conditions and miscalibrate severity, falling short of the reliability required for safety-critical embodied deployment.
AIDec 10, 2025
Toward Closed-loop Molecular Discovery via Language Model, Property Alignment and Strategic SearchJunkai Ji, Zhangfan Yang, Dong Xu et al.
Drug discovery is a time-consuming and expensive process, with traditional high-throughput and docking-based virtual screening hampered by low success rates and limited scalability. Recent advances in generative modelling, including autoregressive, diffusion, and flow-based approaches, have enabled de novo ligand design beyond the limits of enumerative screening. Yet these models often suffer from inadequate generalization, limited interpretability, and an overemphasis on binding affinity at the expense of key pharmacological properties, thereby restricting their translational utility. Here we present Trio, a molecular generation framework integrating fragment-based molecular language modeling, reinforcement learning, and Monte Carlo tree search, for effective and interpretable closed-loop targeted molecular design. Through the three key components, Trio enables context-aware fragment assembly, enforces physicochemical and synthetic feasibility, and guides a balanced search between the exploration of novel chemotypes and the exploitation of promising intermediates within protein binding pockets. Experimental results show that Trio reliably achieves chemically valid and pharmacologically enhanced ligands, outperforming state-of-the-art approaches with improved binding affinity (+7.85%), drug-likeness (+11.10%) and synthetic accessibility (+12.05%), while expanding molecular diversity more than fourfold.
IVMay 8, 2025Code
ADNP-15: An Open-Source Histopathological Dataset for Neuritic Plaque Segmentation in Human Brain Whole Slide Images with Frequency Domain Image Enhancement for Stain NormalizationChenxi Zhao, Jianqiang Li, Qing Zhao et al.
Alzheimer's Disease (AD) is a neurodegenerative disorder characterized by amyloid-beta plaques and tau neurofibrillary tangles, which serve as key histopathological features. The identification and segmentation of these lesions are crucial for understanding AD progression but remain challenging due to the lack of large-scale annotated datasets and the impact of staining variations on automated image analysis. Deep learning has emerged as a powerful tool for pathology image segmentation; however, model performance is significantly influenced by variations in staining characteristics, necessitating effective stain normalization and enhancement techniques. In this study, we address these challenges by introducing an open-source dataset (ADNP-15) of neuritic plaques (i.e., amyloid deposits combined with a crown of dystrophic tau-positive neurites) in human brain whole slide images. We establish a comprehensive benchmark by evaluating five widely adopted deep learning models across four stain normalization techniques, providing deeper insights into their influence on neuritic plaque segmentation. Additionally, we propose a novel image enhancement method that improves segmentation accuracy, particularly in complex tissue structures, by enhancing structural details and mitigating staining inconsistencies. Our experimental results demonstrate that this enhancement strategy significantly boosts model generalization and segmentation accuracy. All datasets and code are open-source, ensuring transparency and reproducibility while enabling further advancements in the field.
CLOct 14, 2024Code
MentalGLM Series: Explainable Large Language Models for Mental Health Analysis on Chinese Social MediaWei Zhai, Nan Bai, Qing Zhao et al.
As the prevalence of mental health challenges, social media has emerged as a key platform for individuals to express their emotions.Deep learning tends to be a promising solution for analyzing mental health on social media. However, black box models are often inflexible when switching between tasks, and their results typically lack explanations. With the rise of large language models (LLMs), their flexibility has introduced new approaches to the field. Also due to the generative nature, they can be prompted to explain decision-making processes. However, their performance on complex psychological analysis still lags behind deep learning. In this paper, we introduce the first multi-task Chinese Social Media Interpretable Mental Health Instructions (C-IMHI) dataset, consisting of 9K samples, which has been quality-controlled and manually validated. We also propose MentalGLM series models, the first open-source LLMs designed for explainable mental health analysis targeting Chinese social media, trained on a corpus of 50K instructions. The proposed models were evaluated on three downstream tasks and achieved better or comparable performance compared to deep learning models, generalized LLMs, and task fine-tuned LLMs. We validated a portion of the generated decision explanations with experts, showing promising results. We also evaluated the proposed models on a clinical dataset, where they outperformed other LLMs, indicating their potential applicability in the clinical field. Our models show strong performance, validated across tasks and perspectives. The decision explanations enhance usability and facilitate better understanding and practical application of the models. Both the constructed dataset and the models are publicly available via: https://github.com/zwzzzQAQ/MentalGLM.
LGMay 8, 2021Code
Class-Incremental Learning for Wireless Device Identification in IoTYongxin Liu, Jian Wang, Jianqiang Li et al.
Deep Learning (DL) has been utilized pervasively in the Internet of Things (IoT). One typical application of DL in IoT is device identification from wireless signals, namely Non-cryptographic Device Identification (NDI). However, learning components in NDI systems have to evolve to adapt to operational variations, such a paradigm is termed as Incremental Learning (IL). Various IL algorithms have been proposed and many of them require dedicated space to store the increasing amount of historical data, and therefore, they are not suitable for IoT or mobile applications. However, conventional IL schemes can not provide satisfying performance when historical data are not available. In this paper, we address the IL problem in NDI from a new perspective, firstly, we provide a new metric to measure the degree of topological maturity of DNN models from the degree of conflict of class-specific fingerprints. We discover that an important cause for performance degradation in IL enabled NDI is owing to the conflict of devices' fingerprints. Second, we also show that the conventional IL schemes can lead to low topological maturity of DNN models in NDI systems. Thirdly, we propose a new Channel Separation Enabled Incremental Learning (CSIL) scheme without using historical data, in which our strategy can automatically separate devices' fingerprints in different learning stages and avoid potential conflict. Finally, We evaluated the effectiveness of the proposed framework using real data from ADS-B (Automatic Dependent Surveillance-Broadcast), an application of IoT in aviation. The proposed framework has the potential to be applied to accurate identification of IoT devices in a variety of IoT applications and services. Data and code available at IEEE Dataport (DOI: 10.21227/1bxc-ke87) and \url{https://github.com/pcwhy/CSIL}}
NIApr 8, 2021Code
Zero-bias Deep Learning Enabled Quick and Reliable Abnormality Detection in IoTYongxin Liu, Jian Wang, Jianqiang Li et al.
Abnormality detection is essential to the performance of safety-critical and latency-constrained systems. However, as systems are becoming increasingly complicated with a large quantity of heterogeneous data, conventional statistical change point detection methods are becoming less effective and efficient. Although Deep Learning (DL) and Deep Neural Networks (DNNs) are increasingly employed to handle heterogeneous data, they still lack theoretic assurable performance and explainability. This paper integrates zero-bias DNN and Quickest Event Detection algorithms to provide a holistic framework for quick and reliable detection of both abnormalities and time-dependent abnormal events in the Internet of Things (IoT). We first use the zero-bias dense layer to increase the explainability of DNN. We provide a solution to convert zero-bias DNN classifiers into performance assured binary abnormality detectors. Using the converted abnormality detector, we then present a sequential quickest detection scheme that provides the theoretically assured lowest abnormal event detection delay under false alarm constraints. Finally, we demonstrate the effectiveness of the framework using both massive signal records from real-world aviation communication systems and simulated data. Code and data of our work is available at \url{https://github.com/pcwhy/AbnormalityDetectionInZbDNN}
CLNov 24, 2023
CMed-GPT: Prompt Tuning for Entity-Aware Chinese Medical Dialogue GenerationZhijie Qu, Juan Li, Zerui Ma et al.
Medical dialogue generation relies on natural language generation techniques to enable online medical consultations. Recently, the widespread adoption of large-scale models in the field of natural language processing has facilitated rapid advancements in this technology. Existing medical dialogue models are mostly based on BERT and pre-trained on English corpora, but there is a lack of high-performing models on the task of Chinese medical dialogue generation. To solve the above problem, this paper proposes CMed-GPT, which is the GPT pre-training language model based on Chinese medical domain text. The model is available in two versions, namely, base and large, with corresponding perplexity values of 8.64 and 8.01. Additionally, we incorporate lexical and entity embeddings into the dialogue text in a uniform manner to meet the requirements of downstream dialogue generation tasks. By applying both fine-tuning and p-tuning to CMed-GPT, we lowered the PPL from 8.44 to 7.35. This study not only confirms the exceptional performance of the CMed-GPT model in generating Chinese biomedical text but also highlights the advantages of p-tuning over traditional fine-tuning with prefix prompts. Furthermore, we validate the significance of incorporating external information in medical dialogue generation, which enhances the quality of dialogue generation.
CLMay 7, 2024
Fine-grained Speech Sentiment Analysis in Chinese Psychological Support Hotlines Based on Large-scale Pre-trained ModelZhonglong Chen, Changwei Song, Yining Chen et al.
Suicide and suicidal behaviors remain significant challenges for public policy and healthcare. In response, psychological support hotlines have been established worldwide to provide immediate help to individuals in mental crises. The effectiveness of these hotlines largely depends on accurately identifying callers' emotional states, particularly underlying negative emotions indicative of increased suicide risk. However, the high demand for psychological interventions often results in a shortage of professional operators, highlighting the need for an effective speech emotion recognition model. This model would automatically detect and analyze callers' emotions, facilitating integration into hotline services. Additionally, it would enable large-scale data analysis of psychological support hotline interactions to explore psychological phenomena and behaviors across populations. Our study utilizes data from the Beijing psychological support hotline, the largest suicide hotline in China. We analyzed speech data from 105 callers containing 20,630 segments and categorized them into 11 types of negative emotions. We developed a negative emotion recognition model and a fine-grained multi-label classification model using a large-scale pre-trained model. Our experiments indicate that the negative emotion recognition model achieves a maximum F1-score of 76.96%. However, it shows limited efficacy in the fine-grained multi-label classification task, with the best model achieving only a 41.74% weighted F1-score. We conducted an error analysis for this task, discussed potential future improvements, and considered the clinical application possibilities of our study. All the codes are public available.
CLApr 17, 2024
AI-Enhanced Cognitive Behavioral Therapy: Deep Learning and Large Language Models for Extracting Cognitive Pathways from Social Media TextsMeng Jiang, Yi Jing Yu, Qing Zhao et al.
Cognitive Behavioral Therapy (CBT) is an effective technique for addressing the irrational thoughts stemming from mental illnesses, but it necessitates precise identification of cognitive pathways to be successfully implemented in patient care. In current society, individuals frequently express negative emotions on social media on specific topics, often exhibiting cognitive distortions, including suicidal behaviors in extreme cases. Yet, there is a notable absence of methodologies for analyzing cognitive pathways that could aid psychotherapists in conducting effective interventions online. In this study, we gathered data from social media and established the task of extracting cognitive pathways, annotating the data based on a cognitive theoretical framework. We initially categorized the task of extracting cognitive pathways as a hierarchical text classification with four main categories and nineteen subcategories. Following this, we structured a text summarization task to help psychotherapists quickly grasp the essential information. Our experiments evaluate the performance of deep learning and large language models (LLMs) on these tasks. The results demonstrate that our deep learning method achieved a micro-F1 score of 62.34% in the hierarchical text classification task. Meanwhile, in the text summarization task, GPT-4 attained a Rouge-1 score of 54.92 and a Rouge-2 score of 30.86, surpassing the experimental deep learning model's performance. However, it may suffer from an issue of hallucination. We have made all models and codes publicly available to support further research in this field.
CVMay 23, 2025
BiggerGait: Unlocking Gait Recognition with Layer-wise Representations from Large Vision ModelsDingqiang Ye, Chao Fan, Zhanbo Huang et al.
Large vision models (LVM) based gait recognition has achieved impressive performance. However, existing LVM-based approaches may overemphasize gait priors while neglecting the intrinsic value of LVM itself, particularly the rich, distinct representations across its multi-layers. To adequately unlock LVM's potential, this work investigates the impact of layer-wise representations on downstream recognition tasks. Our analysis reveals that LVM's intermediate layers offer complementary properties across tasks, integrating them yields an impressive improvement even without rich well-designed gait priors. Building on this insight, we propose a simple and universal baseline for LVM-based gait recognition, termed BiggerGait. Comprehensive evaluations on CCPG, CAISA-B*, SUSTech1K, and CCGR\_MINI validate the superiority of BiggerGait across both within- and cross-domain tasks, establishing it as a simple yet practical baseline for gait representation learning. All the models and code will be publicly available.
RODec 7, 2024
Video2Reward: Generating Reward Function from Videos for Legged Robot Behavior LearningRunhao Zeng, Dingjie Zhou, Qiwei Liang et al.
Learning behavior in legged robots presents a significant challenge due to its inherent instability and complex constraints. Recent research has proposed the use of a large language model (LLM) to generate reward functions in reinforcement learning, thereby replacing the need for manually designed rewards by experts. However, this approach, which relies on textual descriptions to define learning objectives, fails to achieve controllable and precise behavior learning with clear directionality. In this paper, we introduce a new video2reward method, which directly generates reward functions from videos depicting the behaviors to be mimicked and learned. Specifically, we first process videos containing the target behaviors, converting the motion information of individuals in the videos into keypoint trajectories represented as coordinates through a video2text transforming module. These trajectories are then fed into an LLM to generate the reward function, which in turn is used to train the policy. To enhance the quality of the reward function, we develop a video-assisted iterative reward refinement scheme that visually assesses the learned behaviors and provides textual feedback to the LLM. This feedback guides the LLM to continually refine the reward function, ultimately facilitating more efficient behavior learning. Experimental results on tasks involving bipedal and quadrupedal robot motion control demonstrate that our method surpasses the performance of state-of-the-art LLM-based reward generation methods by over 37.6% in terms of human normalized score. More importantly, by switching video inputs, we find our method can rapidly learn diverse motion behaviors such as walking and running.
ROApr 3
Learning Task-Invariant Properties via Dreamer: Enabling Efficient Policy Transfer for Quadruped RobotsJunyang Liang, Yuxuan Liu, Yabin Chang et al.
Achieving quadruped robot locomotion across diverse and dynamic terrains presents significant challenges, primarily due to the discrepancies between simulation environments and real-world conditions. Traditional sim-to-real transfer methods often rely on manual feature design or costly real-world fine-tuning. To address these limitations, this paper proposes the DreamTIP framework, which incorporates Task-Invariant Properties learning within the Dreamer world model architecture to enhance sim-to-real transfer capabilities. Guided by large language models, DreamTIP identifies and leverages Task-Invariant Properties, such as contact stability and terrain clearance, which exhibit robustness to dynamic variations and strong transferability across tasks. These properties are integrated into the world model as auxiliary prediction targets, enabling the policy to learn representations that are insensitive to underlying dynamic changes. Furthermore, an efficient adaptation strategy is designed, employing a mixed replay buffer and regularization constraints to rapidly calibrate to real-world dynamics while effectively mitigating representation collapse and catastrophic forgetting. Extensive experiments on complex terrains, including Stair, Climb, Tilt, and Crawl, demonstrate that DreamTIP significantly outperforms state-of-the-art baselines in both simulated and real-world environments. Our method achieves an average performance improvement of 28.1% across eight distinct simulated transfer tasks. In the real-world Climb task, the baseline method achieved only a 10\ success rate, whereas our method attained a 100% success rate. These results indicate that incorporating Task-Invariant Properties into Dreamer learning offers a novel solution for achieving robust and transferable robot locomotion.
LGJun 20, 2025
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert IdentificationZhenglin Lai, Mengyao Liao, Bingzhe Wu et al.
Large language models with Mixture-of-Experts (MoE) architectures achieve efficiency and scalability, yet their routing mechanisms introduce safety alignment challenges insufficiently addressed by techniques developed for dense models. In this work, the MoE-specific safety risk of positional vulnerability-that safety-aligned behaviors rely on specific expert modules-is formalized and systematically analyzed. An analytical framework, SAFEx, is presented to robustly identify, characterize, and validate safety-critical experts via a stability-based expert selection procedure, and to decompose them into two functional groups: the Harmful Content Detection Group (HCDG), which specializes in identifying and recognizing harmful content within user inputs, and the Harmful Response Control Group (HRCG), which specializes in controlling and enforcing model behaviors to generate appropriate safety responses. Expert-level interventions are conducted to probe causality and to test mitigation. Targeted masking of SAFEx-selected experts reveals that safety behavior is highly concentrated. On Qwen3-30B-A3B, configured with 48 MoE-FFN layers and 128 experts per layer under top-8 routing (48x128=6,144 experts in total), disabling 12 selected experts reduces the refusal rate by 22%. In addition, lightweight adaptation is performed using LoRA under three configurations-the HRCG, the union of HCDG and HRCG, and all experts-and the resulting updates are composed through negative weight merging targeted at the HRCG, leading to improved refusal under adversarial prompts without full-model retraining. These results establish positional vulnerability as a distinct MoE-specific safety challenge and provide a practical, compute-efficient pathway for expert-level safety interventions within routed architectures.
LGJan 21, 2025
Comparative Analysis of Pre-trained Deep Learning Models and DINOv2 for Cushing's Syndrome Diagnosis in Facial AnalysisHongjun Liu, Changwei Song, Jiaqi Qiang et al.
Cushing's syndrome is a condition caused by excessive glucocorticoid secretion from the adrenal cortex, often manifesting with moon facies and plethora, making facial data crucial for diagnosis. Previous studies have used pre-trained convolutional neural networks (CNNs) for diagnosing Cushing's syndrome using frontal facial images. However, CNNs are better at capturing local features, while Cushing's syndrome often presents with global facial features. Transformer-based models like ViT and SWIN, which utilize self-attention mechanisms, can better capture long-range dependencies and global features. Recently, DINOv2, a foundation model based on visual Transformers, has gained interest. This study compares the performance of various pre-trained models, including CNNs, Transformer-based models, and DINOv2, in diagnosing Cushing's syndrome. We also analyze gender bias and the impact of freezing mechanisms on DINOv2. Our results show that Transformer-based models and DINOv2 outperformed CNNs, with ViT achieving the highest F1 score of 85.74%. Both the pre-trained model and DINOv2 had higher accuracy for female samples. DINOv2 also showed improved performance when freezing parameters. In conclusion, Transformer-based models and DINOv2 are effective for Cushing's syndrome classification.
CLApr 19, 2024
SOS-1K: A Fine-grained Suicide Risk Classification Dataset for Chinese Social Media AnalysisHongzhi Qi, Hanfei Liu, Jianqiang Li et al.
In the social media, users frequently express personal emotions, a subset of which may indicate potential suicidal tendencies. The implicit and varied forms of expression in internet language complicate accurate and rapid identification of suicidal intent on social media, thus creating challenges for timely intervention efforts. The development of deep learning models for suicide risk detection is a promising solution, but there is a notable lack of relevant datasets, especially in the Chinese context. To address this gap, this study presents a Chinese social media dataset designed for fine-grained suicide risk classification, focusing on indicators such as expressions of suicide intent, methods of suicide, and urgency of timing. Seven pre-trained models were evaluated in two tasks: high and low suicide risk, and fine-grained suicide risk classification on a level of 0 to 10. In our experiments, deep learning models show good performance in distinguishing between high and low suicide risk, with the best model achieving an F1 score of 88.39%. However, the results for fine-grained suicide risk classification were still unsatisfactory, with an weighted F1 score of 50.89%. To address the issues of data imbalance and limited dataset size, we investigated both traditional and advanced, large language model based data augmentation techniques, demonstrating that data augmentation can enhance model performance by up to 4.65% points in F1-score. Notably, the Chinese MentalBERT model, which was pre-trained on psychological domain data, shows superior performance in both tasks. This study provides valuable insights for automatic identification of suicidal individuals, facilitating timely psychological intervention on social media platforms. The source code and data are publicly available.
RODec 18, 2024
Efficient Language-instructed Skill Acquisition via Reward-Policy Co-EvolutionChangxin Huang, Yanbin Chang, Junfan Lin et al.
The ability to autonomously explore and resolve tasks with minimal human guidance is crucial for the self-development of embodied intelligence. Although reinforcement learning methods can largely ease human effort, it's challenging to design reward functions for real-world tasks, especially for high-dimensional robotic control, due to complex relationships among joints and tasks. Recent advancements large language models (LLMs) enable automatic reward function design. However, approaches evaluate reward functions by re-training policies from scratch placing an undue burden on the reward function, expecting it to be effective throughout the whole policy improvement process. We argue for a more practical strategy in robotic autonomy, focusing on refining existing policies with policy-dependent reward functions rather than a universal one. To this end, we propose a novel reward-policy co-evolution framework where the reward function and the learned policy benefit from each other's progressive on-the-fly improvements, resulting in more efficient and higher-performing skill acquisition. Specifically, the reward evolution process translates the robot's previous best reward function, descriptions of tasks and environment into text inputs. These inputs are used to query LLMs to generate a dynamic amount of reward function candidates, ensuring continuous improvement at each round of evolution. For policy evolution, our method generates new policy populations by hybridizing historically optimal and random policies. Through an improved Bayesian optimization, our approach efficiently and robustly identifies the most capable and plastic reward-policy combination, which then proceeds to the next round of co-evolution. Despite using less data, our approach demonstrates an average normalized improvement of 95.3% across various high-dimensional robotic skill learning tasks.
LGNov 11, 2024
Dockformer: A transformer-based molecular docking paradigm for large-scale virtual screeningZhangfan Yang, Junkai Ji, Shan He et al.
Molecular docking is a crucial step in drug development, which enables the virtual screening of compound libraries to identify potential ligands that target proteins of interest. However, the computational complexity of traditional docking models increases as the size of the compound library increases. Recently, deep learning algorithms can provide data-driven research and development models to increase the speed of the docking process. Unfortunately, few models can achieve superior screening performance compared to that of traditional models. Therefore, a novel deep learning-based docking approach named Dockformer is introduced in this study. Dockformer leverages multimodal information to capture the geometric topology and structural knowledge of molecules and can directly generate binding conformations with the corresponding confidence measures in an end-to-end manner. The experimental results show that Dockformer achieves success rates of 90.53% and 82.71% on the PDBbind core set and PoseBusters benchmarks, respectively, and more than a 100-fold increase in the inference process speed, outperforming almost all state-of-the-art docking methods. In addition, the ability of Dockformer to identify the main protease inhibitors of coronaviruses is demonstrated in a real-world virtual screening scenario. Considering its high docking accuracy and screening efficiency, Dockformer can be regarded as a powerful and robust tool in the field of drug design.
AINov 24, 2025
UNeMo: Collaborative Visual-Language Reasoning and Navigation via a Multimodal World ModelChangxin Huang, Lv Tang, Zhaohuan Zhan et al.
Vision-and-Language Navigation (VLN) requires agents to autonomously navigate complex environments via visual images and natural language instructions--remains highly challenging. Recent research on enhancing language-guided navigation reasoning using pre-trained large language models (LLMs) has shown promising prospects. However, the reasoning of such methods is limited to the linguistic modality, lacking visual reasoning capabilities. Moreover, existing reasoning modules are optimized separately from navigation policies, leading to incompatibility and potential conflicts in optimization objectives.To tackle these challenges, we introduce UNeMo, a novel framework designed for the collaborative optimization of visual state reasoning and navigational decision-making. It introduces a Multimodal World Model (MWM) that takes visual features, language instructions, and navigational actions as inputs to jointly predict subsequent visual states, enabling cross-modal reasoning. Via a Hierarchical Prediction-Feedback (HPN) mechanism, MWM collaborates with navigation policies: the first layer generates actions using current vision-and-language features; MWM then infers post-action visual states to guide the second layer's fine-grained decisions. This forms a dynamic bidirectional promotion mechanism where MWM reasoning optimizes navigation policies, while policy decisions feedback to improve MWM's reasoning accuracy. Experiments on R2R and REVERIE datasets show UNeMo outperforms state-of-the-art methods by 2.1% and 0.7% in navigation accuracy for unseen scenes, validating its effectiveness.
AISep 30, 2025
AuditAgent: Expert-Guided Multi-Agent Reasoning for Cross-Document Fraudulent Evidence DiscoverySongran Bai, Bingzhe Wu, Yiwei Zhang et al.
Financial fraud detection in real-world scenarios presents significant challenges due to the subtlety and dispersion of evidence across complex, multi-year financial disclosures. In this work, we introduce a novel multi-agent reasoning framework AuditAgent, enhanced with auditing domain expertise, for fine-grained evidence chain localization in financial fraud cases. Leveraging an expert-annotated dataset constructed from enforcement documents and financial reports released by the China Securities Regulatory Commission, our approach integrates subject-level risk priors, a hybrid retrieval strategy, and specialized agent modules to efficiently identify and aggregate cross-report evidence. Extensive experiments demonstrate that our method substantially outperforms General-Purpose Agent paradigm in both recall and interpretability, establishing a new benchmark for automated, transparent financial forensics. Our results highlight the value of domain-specific reasoning and dataset construction for advancing robust financial fraud detection in practical, real-world regulatory applications.
QUANT-PHSep 30, 2025
Layerwise Federated Learning for Heterogeneous Quantum Clients using QuorusJason Han, Nicholas S. DiBrita, Daniel Leeds et al.
Quantum machine learning (QML) holds the promise to solve classically intractable problems, but, as critical data can be fragmented across private clients, there is a need for distributed QML in a quantum federated learning (QFL) format. However, the quantum computers that different clients have access to can be error-prone and have heterogeneous error properties, requiring them to run circuits of different depths. We propose a novel solution to this QFL problem, Quorus, that utilizes a layerwise loss function for effective training of varying-depth quantum models, which allows clients to choose models for high-fidelity output based on their individual capacity. Quorus also presents various model designs based on client needs that optimize for shot budget, qubit count, midcircuit measurement, and optimization space. Our simulation and real-hardware results show the promise of Quorus: it increases the magnitude of gradients of higher depth clients and improves testing accuracy by 12.4% on average over the state-of-the-art.
NIAug 25, 2025
Automating Conflict-Aware ACL Configurations with Natural Language IntentsWenlong Ding, Jianqiang Li, Zhixiong Niu et al.
ACL configuration is essential for managing network flow reachability, yet its complexity grows significantly with topologies and pre-existing rules. To carry out ACL configuration, the operator needs to (1) understand the new configuration policies or intents and translate them into concrete ACL rules, (2) check and resolve any conflicts between the new and existing rules, and (3) deploy them across the network. Existing systems rely heavily on manual efforts for these tasks, especially for the first two, which are tedious, error-prone, and impractical to scale. We propose Xumi to tackle this problem. Leveraging LLMs with domain knowledge of the target network, Xumi automatically and accurately translates the natural language intents into complete ACL rules to reduce operators' manual efforts. Xumi then detects all potential conflicts between new and existing rules and generates resolved intents for deployment with operators' guidance, and finally identifies the best deployment plan that minimizes the rule additions while satisfying all intents. Evaluation shows that Xumi accelerates the entire configuration pipeline by over 10x compared to current practices, addresses O(100) conflicting ACLs and reduces rule additions by ~40% in modern cloud network.
ROMay 5, 2025
Automated Hybrid Reward Scheduling via Large Language Models for Robotic Skill LearningChangxin Huang, Junyang Liang, Yanbin Chang et al.
Enabling a high-degree-of-freedom robot to learn specific skills is a challenging task due to the complexity of robotic dynamics. Reinforcement learning (RL) has emerged as a promising solution; however, addressing such problems requires the design of multiple reward functions to account for various constraints in robotic motion. Existing approaches typically sum all reward components indiscriminately to optimize the RL value function and policy. We argue that this uniform inclusion of all reward components in policy optimization is inefficient and limits the robot's learning performance. To address this, we propose an Automated Hybrid Reward Scheduling (AHRS) framework based on Large Language Models (LLMs). This paradigm dynamically adjusts the learning intensity of each reward component throughout the policy optimization process, enabling robots to acquire skills in a gradual and structured manner. Specifically, we design a multi-branch value network, where each branch corresponds to a distinct reward component. During policy optimization, each branch is assigned a weight that reflects its importance, and these weights are automatically computed based on rules designed by LLMs. The LLM generates a rule set in advance, derived from the task description, and during training, it selects a weight calculation rule from the library based on language prompts that evaluate the performance of each branch. Experimental results demonstrate that the AHRS method achieves an average 6.48% performance improvement across multiple high-degree-of-freedom robotic tasks.
IVApr 22, 2025
Performance Estimation for Supervised Medical Image Segmentation Models on Unlabeled Data Using UniverSegJingchen Zou, Jianqiang Li, Gabriel Jimenez et al.
The performance of medical image segmentation models is usually evaluated using metrics like the Dice score and Hausdorff distance, which compare predicted masks to ground truth annotations. However, when applying the model to unseen data, such as in clinical settings, it is often impractical to annotate all the data, making the model's performance uncertain. To address this challenge, we propose the Segmentation Performance Evaluator (SPE), a framework for estimating segmentation models' performance on unlabeled data. This framework is adaptable to various evaluation metrics and model architectures. Experiments on six publicly available datasets across six evaluation metrics including pixel-based metrics such as Dice score and distance-based metrics like HD95, demonstrated the versatility and effectiveness of our approach, achieving a high correlation (0.956$\pm$0.046) and low MAE (0.025$\pm$0.019) compare with real Dice score on the independent test set. These results highlight its ability to reliably estimate model performance without requiring annotations. The SPE framework integrates seamlessly into any model training process without adding training overhead, enabling performance estimation and facilitating the real-world application of medical image segmentation algorithms. The source code is publicly available
QUANT-PHOct 31, 2021
Limitations of the Macaulay matrix approach for using the HHL algorithm to solve multivariate polynomial systemsJintai Ding, Vlad Gheorghiu, András Gilyén et al.
Recently Chen and Gao~\cite{ChenGao2017} proposed a new quantum algorithm for Boolean polynomial system solving, motivated by the cryptanalysis of some post-quantum cryptosystems. The key idea of their approach is to apply a Quantum Linear System (QLS) algorithm to a Macaulay linear system over $\mathbb{C}$, which is derived from the Boolean polynomial system. The efficiency of their algorithm depends on the condition number of the Macaulay matrix. In this paper, we give a strong lower bound on the condition number as a function of the Hamming weight of the Boolean solution, and show that in many (if not all) cases a Grover-based exhaustive search algorithm outperforms their algorithm. Then, we improve upon Chen and Gao's algorithm by introducing the Boolean Macaulay linear system over $\mathbb{C}$ by reducing the original Macaulay linear system. This improved algorithm could potentially significantly outperform the brute-force algorithm, when the Hamming weight of the solution is logarithmic in the number of Boolean variables. Furthermore, we provide a simple and more elementary proof of correctness for our improved algorithm using a reduction employing the Valiant-Vazirani affine hashing method, and also extend the result to polynomial systems over $\mathbb{F}_q$ improving on subsequent work by Chen, Gao and Yuan \cite{ChenGao2018}. We also suggest a new approach for extracting the solution of the Boolean polynomial system via a generalization of the quantum coupon collector problem \cite{arunachalam2020QuantumCouponCollector}.
CRJan 26, 2021
Differential Privacy for Industrial Internet of Things: Opportunities, Applications and ChallengesBin Jiang, Jianqiang Li, Guanghui Yue et al.
The development of Internet of Things (IoT) brings new changes to various fields. Particularly, industrial Internet of Things (IIoT) is promoting a new round of industrial revolution. With more applications of IIoT, privacy protection issues are emerging. Specially, some common algorithms in IIoT technology such as deep models strongly rely on data collection, which leads to the risk of privacy disclosure. Recently, differential privacy has been used to protect user-terminal privacy in IIoT, so it is necessary to make in-depth research on this topic. In this paper, we conduct a comprehensive survey on the opportunities, applications and challenges of differential privacy in IIoT. We firstly review related papers on IIoT and privacy protection, respectively. Then we focus on the metrics of industrial data privacy, and analyze the contradiction between data utilization for deep models and individual privacy protection. Several valuable problems are summarized and new research ideas are put forward. In conclusion, this survey is dedicated to complete comprehensive summary and lay foundation for the follow-up researches on industrial differential privacy.
CRJan 25, 2021
Machine Learning for the Detection and Identification of Internet of Things (IoT) Devices: A SurveyYongxin Liu, Jian Wang, Jianqiang Li et al.
The Internet of Things (IoT) is becoming an indispensable part of everyday life, enabling a variety of emerging services and applications. However, the presence of rogue IoT devices has exposed the IoT to untold risks with severe consequences. The first step in securing the IoT is detecting rogue IoT devices and identifying legitimate ones. Conventional approaches use cryptographic mechanisms to authenticate and verify legitimate devices' identities. However, cryptographic protocols are not available in many systems. Meanwhile, these methods are less effective when legitimate devices can be exploited or encryption keys are disclosed. Therefore, non-cryptographic IoT device identification and rogue device detection become efficient solutions to secure existing systems and will provide additional protection to systems with cryptographic protocols. Non-cryptographic approaches require more effort and are not yet adequately investigated. In this paper, we provide a comprehensive survey on machine learning technologies for the identification of IoT devices along with the detection of compromised or falsified ones from the viewpoint of passive surveillance agents or network operators. We classify the IoT device identification and detection into four categories: device-specific pattern recognition, Deep Learning enabled device identification, unsupervised device identification, and abnormal device detection. Meanwhile, we discuss various ML-related enabling technologies for this purpose. These enabling technologies include learning algorithms, feature engineering on network traffic traces and wireless signals, continual learning, and abnormality detection.
LGAug 27, 2020
Zero-Bias Deep Learning for Accurate Identification of Internet of Things (IoT) DevicesYongxin Liu, Jian Wang, Jianqiang Li et al.
The Internet of Things (IoT) provides applications and services that would otherwise not be possible. However, the open nature of IoT make it vulnerable to cybersecurity threats. Especially, identity spoofing attacks, where an adversary passively listens to existing radio communications and then mimic the identity of legitimate devices to conduct malicious activities. Existing solutions employ cryptographic signatures to verify the trustworthiness of received information. In prevalent IoT, secret keys for cryptography can potentially be disclosed and disable the verification mechanism. Non-cryptographic device verification is needed to ensure trustworthy IoT. In this paper, we propose an enhanced deep learning framework for IoT device identification using physical layer signals. Specifically, we enable our framework to report unseen IoT devices and introduce the zero-bias layer to deep neural networks to increase robustness and interpretability. We have evaluated the effectiveness of the proposed framework using real data from ADS-B (Automatic Dependent Surveillance-Broadcast), an application of IoT in aviation. The proposed framework has the potential to be applied to accurate identification of IoT devices in a variety of IoT applications and services. Codes and data are available in IEEE Dataport.