AIJun 16, 2022Code
Identifying Electrocardiogram Abnormalities Using a Handcrafted-Rule-Enhanced Neural NetworkYuexin Bian, Jintai Chen, Xiaojun Chen et al.
A large number of people suffer from life-threatening cardiac abnormalities, and electrocardiogram (ECG) analysis is beneficial to determining whether an individual is at risk of such abnormalities. Automatic ECG classification methods, especially the deep learning based ones, have been proposed to detect cardiac abnormalities using ECG records, showing good potential to improve clinical diagnosis and help early prevention of cardiovascular diseases. However, the predictions of the known neural networks still do not satisfactorily meet the needs of clinicians, and this phenomenon suggests that some information used in clinical diagnosis may not be well captured and utilized by these methods. In this paper, we introduce some rules into convolutional neural networks, which help present clinical knowledge to deep learning based ECG analysis, in order to improve automated ECG diagnosis performance. Specifically, we propose a Handcrafted-Rule-enhanced Neural Network (called HRNN) for ECG classification with standard 12-lead ECG input, which consists of a rule inference module and a deep learning module. Experiments on two large-scale public ECG datasets show that our new approach considerably outperforms existing state-of-the-art methods. Further, our proposed approach not only can improve the diagnosis performance, but also can assist in detecting mislabelled ECG samples. Our codes are available at https://github.com/alwaysbyx/ecg_processing.
CVFeb 8, 2023Code
Cross-Layer Retrospective Retrieving via Layer AttentionYanwen Fang, Yuxi Cai, Jintai Chen et al.
More and more evidence has shown that strengthening layer interactions can enhance the representation power of a deep neural network, while self-attention excels at learning interdependencies by retrieving query-activated information. Motivated by this, we devise a cross-layer attention mechanism, called multi-head recurrent layer attention (MRLA), that sends a query representation of the current layer to all previous layers to retrieve query-related information from different levels of receptive fields. A light-weighted version of MRLA is also proposed to reduce the quadratic computation cost. The proposed layer attention mechanism can enrich the representation power of many state-of-the-art vision networks, including CNNs and vision transformers. Its effectiveness has been extensively evaluated in image classification, object detection and instance segmentation tasks, where improvements can be consistently observed. For example, our MRLA can improve 1.6% Top-1 accuracy on ResNet-50, while only introducing 0.16M parameters and 0.07B FLOPs. Surprisingly, it can boost the performances by a large margin of 3-4% box AP and mask AP in dense prediction tasks. Our code is available at https://github.com/joyfang1106/MRLA.
AINov 21, 2023
A Survey on Multimodal Large Language Models for Autonomous DrivingCan Cui, Yunsheng Ma, Xu Cao et al.
With the emergence of Large Language Models (LLMs) and Vision Foundation Models (VFMs), multimodal AI systems benefiting from large models have the potential to equally perceive the real world, make decisions, and control tools as humans. In recent months, LLMs have shown widespread attention in autonomous driving and map systems. Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors to apply in LLM driving systems. In this paper, we present a systematic investigation in this field. We first introduce the background of Multimodal Large Language Models (MLLMs), the multimodal models development using LLMs, and the history of autonomous driving. Then, we overview existing MLLM tools for driving, transportation, and map systems together with existing datasets and benchmarks. Moreover, we summarized the works in The 1st WACV Workshop on Large Language and Vision Models for Autonomous Driving (LLVM-AD), which is the first workshop of its kind regarding LLMs in autonomous driving. To further promote the development of this field, we also discuss several important problems regarding using MLLMs in autonomous driving systems that need to be solved by both academia and industry.
AIJul 18, 2023Code
Ord2Seq: Regarding Ordinal Regression as Label Sequence PredictionJinhong Wang, Yi Cheng, Jintai Chen et al.
Ordinal regression refers to classifying object instances into ordinal categories. It has been widely studied in many scenarios, such as medical disease grading, movie rating, etc. Known methods focused only on learning inter-class ordinal relationships, but still incur limitations in distinguishing adjacent categories thus far. In this paper, we propose a simple sequence prediction framework for ordinal regression called Ord2Seq, which, for the first time, transforms each ordinal category label into a special label sequence and thus regards an ordinal regression task as a sequence prediction process. In this way, we decompose an ordinal regression task into a series of recursive binary classification steps, so as to subtly distinguish adjacent categories. Comprehensive experiments show the effectiveness of distinguishing adjacent categories for performance improvement and our new approach exceeds state-of-the-art performances in four different scenarios. Codes are available at https://github.com/wjh892521292/Ord2Seq.
IVApr 20, 2022
Fetal Brain Tissue Annotation and Segmentation Challenge ResultsKelly Payette, Hongwei Li, Priscille de Dumast et al.
In-utero fetal MRI is emerging as an important tool in the diagnosis and analysis of the developing human brain. Automatic segmentation of the developing fetal brain is a vital step in the quantitative analysis of prenatal neurodevelopment both in the research and clinical context. However, manual segmentation of cerebral structures is time-consuming and prone to error and inter-observer variability. Therefore, we organized the Fetal Tissue Annotation (FeTA) Challenge in 2021 in order to encourage the development of automatic segmentation algorithms on an international level. The challenge utilized FeTA Dataset, an open dataset of fetal brain MRI reconstructions segmented into seven different tissues (external cerebrospinal fluid, grey matter, white matter, ventricles, cerebellum, brainstem, deep grey matter). 20 international teams participated in this challenge, submitting a total of 21 algorithms for evaluation. In this paper, we provide a detailed analysis of the results from both a technical and clinical perspective. All participants relied on deep learning methods, mainly U-Nets, with some variability present in the network architecture, optimization, and image pre- and post-processing. The majority of teams used existing medical imaging deep learning frameworks. The main differences between the submissions were the fine tuning done during training, and the specific pre- and post-processing steps performed. The challenge results showed that almost all submissions performed similarly. Four of the top five teams used ensemble learning methods. However, one team's algorithm performed significantly superior to the other submissions, and consisted of an asymmetrical U-Net network architecture. This paper provides a first of its kind benchmark for future automatic multi-tissue segmentation algorithms for the developing human brain in utero.
LGJan 7, 2023
ExcelFormer: A neural network surpassing GBDTs on tabular dataJintai Chen, Jiahuan Yan, Qiyuan Chen et al.
Data organized in tabular format is ubiquitous in real-world applications, and users often craft tables with biased feature definitions and flexibly set prediction targets of their interests. Thus, a rapid development of a robust, effective, dataset-versatile, user-friendly tabular prediction approach is highly desired. While Gradient Boosting Decision Trees (GBDTs) and existing deep neural networks (DNNs) have been extensively utilized by professional users, they present several challenges for casual users, particularly: (i) the dilemma of model selection due to their different dataset preferences, and (ii) the need for heavy hyperparameter searching, failing which their performances are deemed inadequate. In this paper, we delve into this question: Can we develop a deep learning model that serves as a "sure bet" solution for a wide range of tabular prediction tasks, while also being user-friendly for casual users? We delve into three key drawbacks of deep tabular models, encompassing: (P1) lack of rotational variance property, (P2) large data demand, and (P3) over-smooth solution. We propose ExcelFormer, addressing these challenges through a semi-permeable attention module that effectively constrains the influence of less informative features to break the DNNs' rotational invariance property (for P1), data augmentation approaches tailored for tabular data (for P2), and attentive feedforward network to boost the model fitting capability (for P3). These designs collectively make ExcelFormer a "sure bet" solution for diverse tabular datasets. Extensive and stratified experiments conducted on real-world datasets demonstrate that our model outperforms previous approaches across diverse tabular data prediction tasks, and this framework can be friendly to casual users, offering ease of use without the heavy hyperparameter tuning.
CVJan 30Code
Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-TrainingAnglin Liu, Ruichao Chen, Yi Lu et al.
Despite recent Multimodal Large Language Models (MLLMs)' linguistic prowess in medical diagnosis, we find even state-of-the-art MLLMs suffer from a critical perceptual deficit: geometric blindness. This failure to ground outputs in objective geometric constraints leads to plausible yet factually incorrect hallucinations, rooted in training paradigms that prioritize linguistic fluency over geometric fidelity. This paper introduces Med-Scout, a novel framework that "cures" this blindness via Reinforcement Learning (RL) that leverages the intrinsic geometric logic latent within unlabeled medical images. Instead of relying on costly expert annotations, Med-Scout derives verifiable supervision signals through three strategic proxy tasks: Hierarchical Scale Localization, Topological Jigsaw Reconstruction, and Anomaly Consistency Detection. To rigorously quantify this deficit, we present Med-Scout-Bench, a new benchmark specifically designed to evaluate geometric perception. Extensive evaluations show that Med-Scout significantly mitigates geometric blindness, outperforming leading proprietary and open-source MLLMs by over 40% on our benchmark. Furthermore, this enhanced geometric perception generalizes to broader medical understanding, achieving superior results on radiological and comprehensive medical VQA tasks.
LGNov 12, 2022
Robust Training of Graph Neural Networks via Noise GovernanceSiyi Qian, Haochao Ying, Renjun Hu et al.
Graph Neural Networks (GNNs) have become widely-used models for semi-supervised learning. However, the robustness of GNNs in the presence of label noise remains a largely under-explored problem. In this paper, we consider an important yet challenging scenario where labels on nodes of graphs are not only noisy but also scarce. In this scenario, the performance of GNNs is prone to degrade due to label noise propagation and insufficient learning. To address these issues, we propose a novel RTGNN (Robust Training of Graph Neural Networks via Noise Governance) framework that achieves better robustness by learning to explicitly govern label noise. More specifically, we introduce self-reinforcement and consistency regularization as supplemental supervision. The self-reinforcement supervision is inspired by the memorization effects of deep neural networks and aims to correct noisy labels. Further, the consistency regularization prevents GNNs from overfitting to noisy labels via mimicry loss in both the inter-view and intra-view perspectives. To leverage such supervisions, we divide labels into clean and noisy types, rectify inaccurate labels, and further generate pseudo-labels on unlabeled nodes. Supervision for nodes with different types of labels is then chosen adaptively. This enables sufficient learning from clean labels while limiting the impact of noisy ones. We conduct extensive experiments to evaluate the effectiveness of our RTGNN framework, and the results validate its consistent superior performance over state-of-the-art methods with two types of label noises and various noise rates.
LGNov 30, 2022
T2G-Former: Organizing Tabular Features into Relation Graphs Promotes Heterogeneous Feature InteractionJiahuan Yan, Jintai Chen, Yixuan Wu et al.
Recent development of deep neural networks (DNNs) for tabular learning has largely benefited from the capability of DNNs for automatic feature interaction. However, the heterogeneity nature of tabular features makes such features relatively independent, and developing effective methods to promote tabular feature interaction still remains an open problem. In this paper, we propose a novel Graph Estimator, which automatically estimates the relations among tabular features and builds graphs by assigning edges between related features. Such relation graphs organize independent tabular features into a kind of graph data such that interaction of nodes (tabular features) can be conducted in an orderly fashion. Based on our proposed Graph Estimator, we present a bespoke Transformer network tailored for tabular learning, called T2G-Former, which processes tabular data by performing tabular feature interaction guided by the relation graphs. A specific Cross-level Readout collects salient features predicted by the layers in T2G-Former across different levels, and attains global semantics for final prediction. Comprehensive experiments show that our T2G-Former achieves superior performance among DNNs and is competitive with non-deep Gradient Boosted Decision Tree models.
LGJul 21, 2022
ME-GAN: Learning Panoptic Electrocardio Representations for Multi-view ECG Synthesis Conditioned on Heart DiseasesJintai Chen, Kuanlun Liao, Kun Wei et al.
Electrocardiogram (ECG) is a widely used non-invasive diagnostic tool for heart diseases. Many studies have devised ECG analysis models (e.g., classifiers) to assist diagnosis. As an upstream task, researches have built generative models to synthesize ECG data, which are beneficial to providing training samples, privacy protection, and annotation reduction. However, previous generative methods for ECG often neither synthesized multi-view data, nor dealt with heart disease conditions. In this paper, we propose a novel disease-aware generative adversarial network for multi-view ECG synthesis called ME-GAN, which attains panoptic electrocardio representations conditioned on heart diseases and projects the representations onto multiple standard views to yield ECG signals. Since ECG manifestations of heart diseases are often localized in specific waveforms, we propose a new "mixup normalization" to inject disease information precisely into suitable locations. In addition, we propose a view discriminator to revert disordered ECG views into a pre-determined order, supervising the generator to obtain ECG representing correct view characteristics. Besides, a new metric, rFID, is presented to assess the quality of the synthesized ECG signals. Comprehensive experiments verify that our ME-GAN performs well on multi-view ECG signal synthesis with trusty morbid manifestations.
CVSep 16, 2023
GCL: Gradient-Guided Contrastive Learning for Medical Image Segmentation with Multi-Perspective Meta LabelsYixuan Wu, Jintai Chen, Jiahuan Yan et al.
Since annotating medical images for segmentation tasks commonly incurs expensive costs, it is highly desirable to design an annotation-efficient method to alleviate the annotation burden. Recently, contrastive learning has exhibited a great potential in learning robust representations to boost downstream tasks with limited labels. In medical imaging scenarios, ready-made meta labels (i.e., specific attribute information of medical images) inherently reveal semantic relationships among images, which have been used to define positive pairs in previous work. However, the multi-perspective semantics revealed by various meta labels are usually incompatible and can incur intractable "semantic contradiction" when combining different meta labels. In this paper, we tackle the issue of "semantic contradiction" in a gradient-guided manner using our proposed Gradient Mitigator method, which systematically unifies multi-perspective meta labels to enable a pre-trained model to attain a better high-level semantic recognition ability. Moreover, we emphasize that the fine-grained discrimination ability is vital for segmentation-oriented pre-training, and develop a novel method called Gradient Filter to dynamically screen pixel pairs with the most discriminating power based on the magnitude of gradients. Comprehensive experiments on four medical image segmentation datasets verify that our new method GCL: (1) learns informative image representations and considerably boosts segmentation performance with limited labels, and (2) shows promising generalizability on out-of-distribution datasets.
CVAug 19, 2024Code
Graph-guided Cross-composition Feature Disentanglement for Compositional Zero-shot LearningYuxia Geng, Runkai Zhu, Jiaoyan Chen et al.
Disentanglement of visual features of primitives (i.e., attributes and objects) has shown exceptional results in Compositional Zero-shot Learning (CZSL). However, due to the feature divergence of an attribute (resp. object) when combined with different objects (resp. attributes), it is challenging to learn disentangled primitive features that are general across different compositions. To this end, we propose the solution of cross-composition feature disentanglement, which takes multiple primitive-sharing compositions as inputs and constrains the disentangled primitive features to be general across these compositions. More specifically, we leverage a compositional graph to define the overall primitive-sharing relationships between compositions, and build a task-specific architecture upon the recently successful large pre-trained vision-language model (VLM) CLIP, with dual cross-composition disentangling adapters (called L-Adapter and V-Adapter) inserted into CLIP's frozen text and image encoders, respectively. Evaluation on three popular CZSL benchmarks shows that our proposed solution significantly improves the performance of CZSL, and its components have been verified by solid ablation studies. Our code and data are available at:https://github.com/zhurunkai/DCDA.
AIMay 26
Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?Qingyuan Zeng, Ziyang Chen, Pengxiang Cai et al.
Biomedical discovery often requires connecting broad biomedical knowledge with specific experimental or clinical data. Background knowledge suggests relevant mechanisms but is usually too general to map directly onto dataset variables, while data-driven patterns can be dataset-specific and hard to interpret mechanistically. We study this missing link as knowledge contextualization: transforming broad biomedical knowledge into evidence-supported, scenario-grounded propositions that domain experts can inspect, replay, and validate. We propose SCENE, a bi-level multi-agent framework that treats knowledge contextualization as iterative search. The upper level converts broad knowledge into search directions and grounds them in the dataset schema. The lower level executes these directions through multi-objective optimization to identify concrete propositions that balance evidential strength and data support. Feedback between the two levels progressively refines the search. We evaluate SCENE in two settings: discovering patient subgroups with heterogeneous treatment benefits in clinical trial scenarios, and identifying context-specific biological responses in LINCS L1000 studies. In clinical trials, SCENE discovers specific, well-supported subgroups and outperforms existing baselines. In L1000 studies, SCENE identifies perturbational contexts with strong target-response matching and high positive rates. These results show that SCENE bridges broad knowledge and scenario-specific evidence, producing traceable, inspectable hypotheses for follow-up validation.
QUANT-PHAug 24, 2024
Quantum-machine-assisted Drug DiscoveryYidong Zhou, Jintai Chen, Jinglei Cheng et al.
Drug discovery is lengthy and expensive, with traditional computer-aided design facing limits. This paper examines integrating quantum computing across the drug development cycle to accelerate and enhance workflows and rigorous decision-making. It highlights quantum approaches for molecular simulation, drug-target interaction prediction, and optimizing clinical trials. Leveraging quantum capabilities could accelerate timelines and costs for bringing therapies to market, improving efficiency and ultimately benefiting public health.
CLNov 15, 2023
Mind's Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language ModelsWeize Liu, Guocong Li, Kai Zhang et al.
Large language models (LLMs) have achieved remarkable advancements in natural language processing. However, the massive scale and computational demands of these models present formidable challenges when considering their practical deployment in resource-constrained environments. While techniques such as chain-of-thought (CoT) distillation have displayed promise in distilling LLMs into small language models (SLMs), there is a risk that distilled SLMs may still inherit flawed reasoning and hallucinations from LLMs. To address these issues, we propose a twofold methodology: First, we introduce a novel method for distilling the self-evaluation capability from LLMs into SLMs, aiming to mitigate the adverse effects of flawed reasoning and hallucinations inherited from LLMs. Second, we advocate for distilling more comprehensive thinking by incorporating multiple distinct CoTs and self-evaluation outputs, to ensure a more thorough and robust knowledge transfer into SLMs. Experiments on three NLP benchmarks demonstrate that our method significantly improves the performance of distilled SLMs, offering a new perspective for developing more effective and efficient SLMs in resource-constrained environments.
LGMay 25
Don't Retrain, Just Reuse: Recovering Dual-Target Molecules from Single-Target Diffusion ModelsQingyuan Zeng, Pengxiang Cai, Zixin Guan et al.
Designing a single molecule that modulates two targets is a promising strategy for polypharmacology, but it remains substantially harder than standard single-target generation because one candidate must satisfy two binding requirements while preserving drug-likeness and synthesizability. Existing dual-target generative methods typically introduce dual-target capability by either retraining the generator or intervening in the diffusion process during sampling. The former can be costly and difficult to stabilize when dual-target supervision is sparse, while the latter may be sensitive to denoising-time target balancing and competing update directions. These limitations motivate a generator-preserving alternative that keeps the pretrained prior intact: can dual-target candidates instead be recovered from the input space of a frozen single-target diffusion model, without modifying its parameters or denoising dynamics? We formulate this task as a constrained multi-objective optimization problem and propose REUSE, a hierarchical evolutionary input-space search framework that combines pair-conditioned exploration with structured multi-stage selection to enforce dual-target affinity, chemical quality, and diversity. Experiments show that, compared with methods that modify the diffusion process, REUSE consistently improves dual-target affinity and balance, achieving a 20.9-percentage-point gain in Dual High Affinity over the strongest prior baseline while maintaining competitive molecular quality.
CLApr 7Code
Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward ModelingQiyuan Chen, Hongsen Huang, Jiahe Chen et al.
Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque "black boxes." To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve reliability, providing a scalable solution for VLM alignment.
AIJul 27, 2024
Multi-Modal CLIP-Informed Protein EditingMingze Yin, Hanjing Zhou, Yiheng Zhu et al.
Proteins govern most biological functions essential for life, but achieving controllable protein discovery and optimization remains challenging. Recently, machine learning-assisted protein editing (MLPE) has shown promise in accelerating optimization cycles and reducing experimental workloads. However, current methods struggle with the vast combinatorial space of potential protein edits and cannot explicitly conduct protein editing using biotext instructions, limiting their interactivity with human feedback. To fill these gaps, we propose a novel method called ProtET for efficient CLIP-informed protein editing through multi-modality learning. Our approach comprises two stages: in the pretraining stage, contrastive learning aligns protein-biotext representations encoded by two large language models (LLMs), respectively. Subsequently, during the protein editing stage, the fused features from editing instruction texts and original protein sequences serve as the final editing condition for generating target protein sequences. Comprehensive experiments demonstrated the superiority of ProtET in editing proteins to enhance human-expected functionality across multiple attribute domains, including enzyme catalytic activity, protein stability and antibody specific binding ability. And ProtET improves the state-of-the-art results by a large margin, leading to significant stability improvements of 16.67% and 16.90%. This capability positions ProtET to advance real-world artificial protein editing, potentially addressing unmet academic, industrial, and clinical needs.
LGJul 13, 2024
Team up GBDTs and DNNs: Advancing Efficient and Effective Tabular Prediction with Tree-hybrid MLPsJiahuan Yan, Jintai Chen, Qianxing Wang et al.
Tabular datasets play a crucial role in various applications. Thus, developing efficient, effective, and widely compatible prediction algorithms for tabular data is important. Currently, two prominent model types, Gradient Boosted Decision Trees (GBDTs) and Deep Neural Networks (DNNs), have demonstrated performance advantages on distinct tabular prediction tasks. However, selecting an effective model for a specific tabular dataset is challenging, often demanding time-consuming hyperparameter tuning. To address this model selection dilemma, this paper proposes a new framework that amalgamates the advantages of both GBDTs and DNNs, resulting in a DNN algorithm that is as efficient as GBDTs and is competitively effective regardless of dataset preferences for GBDTs or DNNs. Our idea is rooted in an observation that deep learning (DL) offers a larger parameter space that can represent a well-performing GBDT model, yet the current back-propagation optimizer struggles to efficiently discover such optimal functionality. On the other hand, during GBDT development, hard tree pruning, entropy-driven feature gate, and model ensemble have proved to be more adaptable to tabular data. By combining these key components, we present a Tree-hybrid simple MLP (T-MLP). In our framework, a tensorized, rapidly trained GBDT feature gate, a DNN architecture pruning approach, as well as a vanilla back-propagation optimizer collaboratively train a randomly initialized MLP model. Comprehensive experiments show that T-MLP is competitive with extensively tuned DNNs and GBDTs in their dominating tabular benchmarks (88 datasets) respectively, all achieved with compact model storage and significantly reduced training duration.
CLNov 28, 2023
Text2Tree: Aligning Text Representation to the Label Tree Hierarchy for Imbalanced Medical ClassificationJiahuan Yan, Haojun Gao, Zhang Kai et al.
Deep learning approaches exhibit promising performances on various text tasks. However, they are still struggling on medical text classification since samples are often extremely imbalanced and scarce. Different from existing mainstream approaches that focus on supplementary semantics with external medical information, this paper aims to rethink the data challenges in medical texts and present a novel framework-agnostic algorithm called Text2Tree that only utilizes internal label hierarchy in training deep learning models. We embed the ICD code tree structure of labels into cascade attention modules for learning hierarchy-aware label representations. Two new learning schemes, Similarity Surrogate Learning (SSL) and Dissimilarity Mixup Learning (DML), are devised to boost text classification by reusing and distinguishing samples of other labels following the label representation hierarchy, respectively. Experiments on authoritative public datasets and real-world medical records show that our approach stably achieves superior performances over classical and advanced imbalanced classification methods.
IVJul 29, 2024
TeleOR: Real-time Telemedicine System for Full-Scene Operating RoomYixuan Wu, Kaiyuan Hu, Qian Shao et al.
The advent of telemedicine represents a transformative development in leveraging technology to extend the reach of specialized medical expertise to remote surgeries, a field where the immediacy of expert guidance is paramount. However, the intricate dynamics of Operating Room (OR) scene pose unique challenges for telemedicine, particularly in achieving high-fidelity, real-time scene reconstruction and transmission amidst obstructions and bandwidth limitations. This paper introduces TeleOR, a pioneering system designed to address these challenges through real-time OR scene reconstruction for Tele-intervention. TeleOR distinguishes itself with three innovative approaches: dynamic self-calibration, which leverages inherent scene features for calibration without the need for preset markers, allowing for obstacle avoidance and real-time camera adjustment; selective OR reconstruction, focusing on dynamically changing scene segments to reduce reconstruction complexity; and viewport-adaptive transmission, optimizing data transmission based on real-time client feedback to efficiently deliver high-quality 3D reconstructions within bandwidth constraints. Comprehensive experiments on the 4D-OR surgical scene dataset demostrate the superiority and applicability of TeleOR, illuminating the potential to revolutionize tele-interventions by overcoming the spatial and technical barriers inherent in remote surgical guidance.
CVSep 24, 2023
OneSeg: Self-learning and One-shot Learning based Single-slice Annotation for 3D Medical Image SegmentationYixuan Wu, Bo Zheng, Jintai Chen et al.
As deep learning methods continue to improve medical image segmentation performance, data annotation is still a big bottleneck due to the labor-intensive and time-consuming burden on medical experts, especially for 3D images. To significantly reduce annotation efforts while attaining competitive segmentation accuracy, we propose a self-learning and one-shot learning based framework for 3D medical image segmentation by annotating only one slice of each 3D image. Our approach takes two steps: (1) self-learning of a reconstruction network to learn semantic correspondence among 2D slices within 3D images, and (2) representative selection of single slices for one-shot manual annotation and propagating the annotated data with the well-trained reconstruction network. Extensive experiments verify that our new framework achieves comparable performance with less than 1% annotated data compared with fully supervised methods and generalizes well on several out-of-distribution testing sets.
CVMar 12, 2024Code
LKM-UNet: Large Kernel Vision Mamba UNet for Medical Image SegmentationJinhong Wang, Jintai Chen, Danny Chen et al.
In clinical practice, medical image segmentation provides useful information on the contours and dimensions of target organs or tissues, facilitating improved diagnosis, analysis, and treatment. In the past few years, convolutional neural networks (CNNs) and Transformers have dominated this area, but they still suffer from either limited receptive fields or costly long-range modeling. Mamba, a State Space Sequence Model (SSM), recently emerged as a promising paradigm for long-range dependency modeling with linear complexity. In this paper, we introduce a Large Kernel Vision Mamba U-shape Network, or LKM-UNet, for medical image segmentation. A distinguishing feature of our LKM-UNet is its utilization of large Mamba kernels, excelling in locally spatial modeling compared to small kernel-based CNNs and Transformers, while maintaining superior efficiency in global modeling compared to self-attention with quadratic complexity. Additionally, we design a novel hierarchical and bidirectional Mamba block to further enhance Mamba's global and neighborhood spatial modeling capability for vision inputs. Comprehensive experiments demonstrate the feasibility and the effectiveness of using large-size Mamba kernels to achieve large receptive fields. Codes are available at https://github.com/wjh892521292/LKM-UNet.
LGSep 22, 2024
Protein-Mamba: Biological Mamba Models for Protein Function PredictionBohao Xu, Yingzhou Lu, Yoshitaka Inoue et al.
Protein function prediction is a pivotal task in drug discovery, significantly impacting the development of effective and safe therapeutics. Traditional machine learning models often struggle with the complexity and variability inherent in predicting protein functions, necessitating more sophisticated approaches. In this work, we introduce Protein-Mamba, a novel two-stage model that leverages both self-supervised learning and fine-tuning to improve protein function prediction. The pre-training stage allows the model to capture general chemical structures and relationships from large, unlabeled datasets, while the fine-tuning stage refines these insights using specific labeled datasets, resulting in superior prediction performance. Our extensive experiments demonstrate that Protein-Mamba achieves competitive performance, compared with a couple of state-of-the-art methods across a range of protein function datasets. This model's ability to effectively utilize both unlabeled and labeled data highlights the potential of self-supervised learning in advancing protein function prediction and offers a promising direction for future research in drug discovery.
AIDec 16, 2025Code
Incentivizing Tool-augmented Thinking with Images for Medical Image AnalysisYankai Jiang, Yujie Zhang, Peng Zhang et al.
Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought. In contrast to prior approaches limited by the performance ceiling of specialized tools, Ophiuchus integrates the model's inherent grounding and perception capabilities with external tools, thereby fostering higher-level reasoning. The core of our method is a three-stage training strategy: cold-start training with tool-integrated reasoning data to achieve basic tool selection and adaptation for inspecting key regions; self-reflection fine-tuning to strengthen reflective reasoning and encourage revisiting tool outputs; and Agentic Tool Reinforcement Learning to directly optimize task-specific rewards and emulate expert-like diagnostic behavior. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation. Our approach illuminates a path toward medical AI agents that can genuinely "think with images" through tool-integrated reasoning. Datasets, codes, and trained models will be released publicly.
AINov 11, 2025
Versatile and Risk-Sensitive Cardiac Diagnosis via Graph-Based ECG Signal RepresentationYue Wang, Yuyang Xu, Renjun Hu et al.
Despite the rapid advancements of electrocardiogram (ECG) signal diagnosis and analysis methods through deep learning, two major hurdles still limit their clinical adoption: the lack of versatility in processing ECG signals with diverse configurations, and the inadequate detection of risk signals due to sample imbalances. Addressing these challenges, we introduce VersAtile and Risk-Sensitive cardiac diagnosis (VARS), an innovative approach that employs a graph-based representation to uniformly model heterogeneous ECG signals. VARS stands out by transforming ECG signals into versatile graph structures that capture critical diagnostic features, irrespective of signal diversity in the lead count, sampling frequency, and duration. This graph-centric formulation also enhances diagnostic sensitivity, enabling precise localization and identification of abnormal ECG patterns that often elude standard analysis methods. To facilitate representation transformation, our approach integrates denoising reconstruction with contrastive learning to preserve raw ECG information while highlighting pathognomonic patterns. We rigorously evaluate the efficacy of VARS on three distinct ECG datasets, encompassing a range of structural variations. The results demonstrate that VARS not only consistently surpasses existing state-of-the-art models across all these datasets but also exhibits substantial improvement in identifying risk signals. Additionally, VARS offers interpretability by pinpointing the exact waveforms that lead to specific model outputs, thereby assisting clinicians in making informed decisions. These findings suggest that our VARS will likely emerge as an invaluable tool for comprehensive cardiac health assessment.
LGSep 13, 2024
Quantum-inspired Reinforcement Learning for Synthesizable Drug DesignDannong Wang, Jintai Chen, Zhiding Liang et al.
Synthesizable molecular design (also known as synthesizable molecular optimization) is a fundamental problem in drug discovery, and involves designing novel molecular structures to improve their properties according to drug-relevant oracle functions (i.e., objective) while ensuring synthetic feasibility. However, existing methods are mostly based on random search. To address this issue, in this paper, we introduce a novel approach using the reinforcement learning method with quantum-inspired simulated annealing policy neural network to navigate the vast discrete space of chemical structures intelligently. Specifically, we employ a deterministic REINFORCE algorithm using policy neural networks to output transitional probability to guide state transitions and local search using genetic algorithm to refine solutions to a local optimum within each iteration. Our methods are evaluated with the Practical Molecular Optimization (PMO) benchmark framework with a 10K query budget. We further showcase the competitive performance of our method by comparing it against the state-of-the-art genetic algorithms-based method.
LGJul 18, 2024
TrialEnroll: Predicting Clinical Trial Enrollment Success with Deep & Cross Network and Large Language ModelsLing Yue, Sixue Xing, Jintai Chen et al.
Clinical trials need to recruit a sufficient number of volunteer patients to demonstrate the statistical power of the treatment (e.g., a new drug) in curing a certain disease. Clinical trial recruitment has a significant impact on trial success. Forecasting whether the recruitment process would be successful before we run the trial would save many resources and time. This paper develops a novel deep & cross network with large language model (LLM)-augmented text feature that learns semantic information from trial eligibility criteria and predicts enrollment success. The proposed method enables interpretability by understanding which sentence/word in eligibility criteria contributes heavily to prediction. We also demonstrate the empirical superiority of the proposed method (0.7002 PR-AUC) over a bunch of well-established machine learning methods. The code and curated dataset are publicly available at https://anonymous.4open.science/r/TrialEnroll-7E12.
IVJun 23, 2025Code
MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and DiagnosisYuting Zhang, Kaishen Yuan, Hao Lu et al.
Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1's superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code are available at https://github.com/keke-nice/MedTVT-R1.
CVJun 6, 2023
PGformer: Proxy-Bridged Game Transformer for Multi-Person Highly Interactive Extreme Motion PredictionYanwen Fang, Jintai Chen, Peng-Tao Jiang et al.
Multi-person motion prediction is a challenging task, especially for real-world scenarios of highly interacted persons. Most previous works have been devoted to studying the case of weak interactions (e.g., walking together), in which typically forecasting each human pose in isolation can still achieve good performances. This paper focuses on collaborative motion prediction for multiple persons with extreme motions and attempts to explore the relationships between the highly interactive persons' pose trajectories. Specifically, a novel cross-query attention (XQA) module is proposed to bilaterally learn the cross-dependencies between the two pose sequences tailored for this situation. A proxy unit is additionally introduced to bridge the involved persons, which cooperates with our proposed XQA module and subtly controls the bidirectional spatial information flows. These designs are then integrated into a Transformer-based architecture and the resulting model is called Proxy-bridged Game Transformer (PGformer) for multi-person interactive motion prediction. Its effectiveness has been evaluated on the challenging ExPI dataset, which involves highly interactive actions. Our PGformer consistently outperforms the state-of-the-art methods in both short- and long-term predictions by a large margin. Besides, our approach can also be compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets and extended to the case of more than 2 individuals with encouraging results.
CVMar 28, 2024Code
PoCo: A Self-Supervised Approach via Polar Transformation Based Progressive Contrastive Learning for Ophthalmic Disease DiagnosisJinhong Wang, Tingting Chen, Jintai Chen et al.
Automatic ophthalmic disease diagnosis on fundus images is important in clinical practice. However, due to complex fundus textures and limited annotated data, developing an effective automatic method for this problem is still challenging. In this paper, we present a self-supervised method via polar transformation based progressive contrastive learning, called PoCo, for ophthalmic disease diagnosis. Specifically, we novelly inject the polar transformation into contrastive learning to 1) promote contrastive learning pre-training to be faster and more stable and 2) naturally capture task-free and rotation-related textures, which provides insights into disease recognition on fundus images. Beneficially, simple normal translation-invariant convolution on transformed images can equivalently replace the complex rotation-invariant and sector convolution on raw images. After that, we develop a progressive contrastive learning method to efficiently utilize large unannotated images and a novel progressive hard negative sampling scheme to gradually reduce the negative sample number for efficient training and performance enhancement. Extensive experiments on three public ophthalmic disease datasets show that our PoCo achieves state-of-the-art performance with good generalization ability, validating that our method can reduce annotation efforts and provide reliable diagnosis. Codes are available at \url{https://github.com/wjh892521292/PoCo}.
AIJan 1
ClinicalReTrial: A Self-Evolving AI Agent for Clinical Trial Protocol OptimizationSixue Xing, Xuanye Xia, Kerui Wu et al.
Clinical trial failure remains a central bottleneck in drug development, where minor protocol design flaws can irreversibly compromise outcomes despite promising therapeutics. Although cutting-edge AI methods achieve strong performance in predicting trial success, they are inherently reactive for merely diagnosing risk without offering actionable remedies once failure is anticipated. To fill this gap, this paper proposes ClinicalReTrial, a self-evolving AI agent framework that addresses this gap by casting clinical trial reasoning as an iterative protocol redesign problem. Our method integrates failure diagnosis, safety-aware modification, and candidate evaluation in a closed-loop, reward-driven optimization framework. Serving the outcome prediction model as a simulation environment, ClinicalReTrial enables low-cost evaluation of protocol modifications and provides dense reward signals for continuous self-improvement. To support efficient exploration, the framework maintains hierarchical memory that captures iteration-level feedback within trials and distills transferable redesign patterns across trials. Empirically, ClinicalReTrial improves 83.3% of trial protocols with a mean success probability gain of 5.7%, and retrospective case studies demonstrate strong alignment between the discovered redesign strategies and real-world clinical trial modifications.
CVNov 24, 2025Code
MedSAM3: Delving into Segment Anything with Medical ConceptsAnglin Liu, Rundong Xue, Xu R. Cao et al.
Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.
CVJun 26, 2025Code
Curing Semantic Drift: A Dynamic Approach to Grounding Generation in Large Vision-Language ModelsJiahe Chen, Jiaying He, Qiyuan Chen et al.
Large Vision-Language Models (LVLMs) face a tug-of-war between powerful linguistic priors and visual evidence, often leading to ``semantic drift'' -- the progressive detachment from visual input that we identify as the root cause of hallucination. While several existing training-free decoding strategies have achieved considerable success, they still suffer from inherent limitations. Many are computationally prohibitive, requiring multiple forward passes through the entire LVLM, while others rely on indirect, heuristic-based proxies that are unreliable correlates for a direct semantic conflict. We propose \textbf{D}ynamic \textbf{L}ogits \textbf{C}alibration (DLC), a novel training-free framework that is the first to cure semantic drift in a direct, dynamic, and efficient manner. At each decoding step, DLC introduces a real-time visual referee that performs a dual-aspect visual alignment check: (1) it assesses the intrinsic visual relevance of a candidate token and (2) its contextual visual coherence. By dynamically balancing these two checks and evaluating them against an adaptive baseline, DLC surgically modulates the output logits to favor grounded tokens. Extensive experiments show DLC significantly outperforms existing methods in mitigating hallucinations while, crucially, maintaining high inference efficiency by avoiding costly multiple LVLM forward passes. Our work presents a powerful and practical solution for building more reliable and visually-grounded LVLMs. Code will be released on https://github.com/JiaheChen2002/DLC.
CVJun 14, 2024Code
What is the Visual Cognition Gap between Humans and Multimodal LLMs?Xu Cao, Yifan Shen, Bolin Lai et al.
Recently, Multimodal Large Language Models (MLLMs) and Vision Language Models (VLMs) have shown great promise in language-guided perceptual tasks such as recognition, segmentation, and object detection. However, their effectiveness in addressing visual cognition problems that require high-level multi-image reasoning and visual working memory is not well-established. One such challenge is matrix reasoning - the cognitive ability to discern relationships among patterns in a set of images and extrapolate to predict subsequent patterns. This skill is crucial during the early neurodevelopmental stages of children. Inspired by the matrix reasoning tasks in Raven's Progressive Matrices (RPM) and Wechsler Intelligence Scale for Children (WISC), we propose a new dataset MaRs-VQA to evaluate the visual cognition capability of MLLMs and compare their performance with existing human visual cognition studies. Based on the training data of MaRs-VQA, we also finetune a baseline model Qwen2-VCog with multi-stage cognition reasoning annotations. Our comparative experiments with different baselines reveal a gap between MLLMs and human intelligence, highlighting the visual cognitive limitations of current MLLMs. We believe that the public release of MaRs-VQA and the Qwen2-VCog baseline model will drive progress toward the next generation of MLLMs with human-like visual cognition abilities. MaRs-VQA is available at huggingface.co/datasets/IrohXu/VCog-Bench. The training code of Qwen2-VCog is available at github.com/IrohXu/Cognition-MLLM.
LGDec 6, 2021Code
DANets: Deep Abstract Networks for Tabular Data Classification and RegressionJintai Chen, Kuanlun Liao, Yao Wan et al.
Tabular data are ubiquitous in real world applications. Although many commonly-used neural components (e.g., convolution) and extensible neural networks (e.g., ResNet) have been developed by the machine learning community, few of them were effective for tabular data and few designs were adequately tailored for tabular data structures. In this paper, we propose a novel and flexible neural component for tabular data, called Abstract Layer (AbstLay), which learns to explicitly group correlative input features and generate higher-level features for semantics abstraction. Also, we design a structure re-parameterization method to compress the learned AbstLay, thus reducing the computational complexity by a clear margin in the reference phase. A special basic block is built using AbstLays, and we construct a family of Deep Abstract Networks (DANets) for tabular data classification and regression by stacking such blocks. In DANets, a special shortcut path is introduced to fetch information from raw tabular features, assisting feature interactions across different levels. Comprehensive experiments on seven real-world tabular datasets show that our AbstLay and DANets are effective for tabular data classification and regression, and the computational complexity is superior to competitive methods. Besides, we evaluate the performance gains of DANet as it goes deep, verifying the extendibility of our method. Our code is available at https://github.com/WhatAShot/DANet.
SPNov 4, 2025
NEF-NET+: Adapting Electrocardio panorama in the wildZehui Zhan, Yaojun Hu, Jiajing Zhan et al.
Conventional multi-lead electrocardiogram (ECG) systems capture cardiac signals from a fixed set of anatomical viewpoints defined by lead placement. However, certain cardiac conditions (e.g., Brugada syndrome) require additional, non-standard viewpoints to reveal diagnostically critical patterns that may be absent in standard leads. To systematically overcome this limitation, Nef-Net was recently introduced to reconstruct a continuous electrocardiac field, enabling virtual observation of ECG signals from arbitrary views (termed Electrocardio Panorama). Despite its promise, Nef-Net operates under idealized assumptions and faces in-the-wild challenges, such as long-duration ECG modeling, robustness to device-specific signal artifacts, and suboptimal lead placement calibration. This paper presents NEF-NET+, an enhanced framework for realistic panoramic ECG synthesis that supports arbitrary-length signal synthesis from any desired view, generalizes across ECG devices, and compensates for operator-induced deviations in electrode placement. These capabilities are enabled by a newly designed model architecture that performs direct view transformation, incorporating a workflow comprising offline pretraining, device calibration tuning steps as well as an on-the-fly calibration step for patient-specific adaptation. To rigorously evaluate panoramic ECG synthesis, we construct a new Electrocardio Panorama benchmark, called Panobench, comprising 5367 recordings with 48-view per subject, capturing the full spatial variability of cardiac electrical activity. Experimental results show that NEF-NET+ delivers substantial improvements over Nef-Net, yielding an increase of around 6 dB in PSNR in real-world setting. The code and Panobench will be released in a subsequent publication.
CLMar 4, 2024
Making Pre-trained Language Models Great on Tabular PredictionJiahuan Yan, Bo Zheng, Hongxia Xu et al.
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing. However, due to the heterogeneity among tables, such DNN bonus is still far from being well exploited on tabular data prediction (e.g., regression or classification tasks). Condensing knowledge from diverse domains, language models (LMs) possess the capability to comprehend feature names from various tables, potentially serving as versatile learners in transferring knowledge across distinct tables and diverse prediction tasks, but their discrete text representation space is inherently incompatible with numerical feature values in tables. In this paper, we present TP-BERTa, a specifically pre-trained LM for tabular data prediction. Concretely, a novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names. Comprehensive experiments demonstrate that our pre-trained TP-BERTa leads the performance among tabular DNNs and is competitive with Gradient Boosted Decision Tree models in typical tabular data regime.
LGJan 7, 2024
Uncertainty Quantification on Clinical Trial Outcome PredictionTianyi Chen, Yingzhou Lu, Nan Hao et al.
The importance of uncertainty quantification is increasingly recognized in the diverse field of machine learning. Accurately assessing model prediction uncertainty can help provide deeper understanding and confidence for researchers and practitioners. This is especially critical in medical diagnosis and drug discovery areas, where reliable predictions directly impact research quality and patient health. In this paper, we proposed incorporating uncertainty quantification into clinical trial outcome predictions. Our main goal is to enhance the model's ability to discern nuanced differences, thereby significantly improving its overall performance. We have adopted a selective classification approach to fulfill our objective, integrating it seamlessly with the Hierarchical Interaction Network (HINT), which is at the forefront of clinical trial prediction modeling. Selective classification, encompassing a spectrum of methods for uncertainty quantification, empowers the model to withhold decision-making in the face of samples marked by ambiguity or low confidence, thereby amplifying the accuracy of predictions for the instances it chooses to classify. A series of comprehensive experiments demonstrate that incorporating selective classification into clinical trial predictions markedly enhances the model's performance, as evidenced by significant upticks in pivotal metrics such as PR-AUC, F1, ROC-AUC, and overall accuracy. Specifically, the proposed method achieved 32.37\%, 21.43\%, and 13.27\% relative improvement on PR-AUC over the base model (HINT) in phase I, II, and III trial outcome prediction, respectively. When predicting phase III, our method reaches 0.9022 PR-AUC scores. These findings illustrate the robustness and prospective utility of this strategy within the area of clinical trial predictions, potentially setting a new benchmark in the field.
CLApr 23, 2024
ClinicalAgent: Clinical Trial Multi-Agent System with Large Language Model-based ReasoningLing Yue, Sixue Xing, Jintai Chen et al.
Large Language Models (LLMs) and multi-agent systems have shown impressive capabilities in natural language tasks but face challenges in clinical trial applications, primarily due to limited access to external knowledge. Recognizing the potential of advanced clinical trial tools that aggregate and predict based on the latest medical data, we propose an integrated solution to enhance their accessibility and utility. We introduce Clinical Agent System (ClinicalAgent), a clinical multi-agent system designed for clinical trial tasks, leveraging GPT-4, multi-agent architectures, LEAST-TO-MOST, and ReAct reasoning technology. This integration not only boosts LLM performance in clinical contexts but also introduces novel functionalities. The proposed method achieves competitive predictive performance in clinical trial outcome prediction (0.7908 PR-AUC), obtaining a 0.3326 improvement over the standard prompt Method. Publicly available code can be found at https://anonymous.4open.science/r/ClinicalAgent-6671.
AIJan 13, 2025
From Screens to Scenes: A Survey of Embodied AI in HealthcareYihao Liu, Xu Cao, Tingting Chen et al.
Healthcare systems worldwide face persistent challenges in efficiency, accessibility, and personalization. Powered by modern AI technologies such as multimodal large language models and world models, Embodied AI (EmAI) represents a transformative frontier, offering enhanced autonomy and the ability to interact with the physical world to address these challenges. As an interdisciplinary and rapidly evolving research domain, "EmAI in healthcare" spans diverse fields such as algorithms, robotics, and biomedicine. This complexity underscores the importance of timely reviews and analyses to track advancements, address challenges, and foster cross-disciplinary collaboration. In this paper, we provide a comprehensive overview of the "brain" of EmAI for healthcare, wherein we introduce foundational AI algorithms for perception, actuation, planning, and memory, and focus on presenting the healthcare applications spanning clinical interventions, daily care & companionship, infrastructure support, and biomedical research. Despite its promise, the development of EmAI for healthcare is hindered by critical challenges such as safety concerns, gaps between simulation platforms and real-world applications, the absence of standardized benchmarks, and uneven progress across interdisciplinary domains. We discuss the technical barriers and explore ethical considerations, offering a forward-looking perspective on the future of EmAI in healthcare. A hierarchical framework of intelligent levels for EmAI systems is also introduced to guide further development. By providing systematic insights, this work aims to inspire innovation and practical applications, paving the way for a new era of intelligent, patient-centered healthcare.
CVNov 18, 2024
Medical Video Generation for Disease Progression SimulationXu Cao, Kaizhao Liang, Kuei-Da Liao et al.
Modeling disease progression is crucial for improving the quality and efficacy of clinical diagnosis and prognosis, but it is often hindered by a lack of longitudinal medical image monitoring for individual patients. To address this challenge, we propose the first Medical Video Generation (MVG) framework that enables controlled manipulation of disease-related image and video features, allowing precise, realistic, and personalized simulations of disease progression. Our approach begins by leveraging large language models (LLMs) to recaption prompt for disease trajectory. Next, a controllable multi-round diffusion model simulates the disease progression state for each patient, creating realistic intermediate disease state sequence. Finally, a diffusion-based video transition generation model interpolates disease progression between these states. We validate our framework across three medical imaging domains: chest X-ray, fundus photography, and skin image. Our results demonstrate that MVG significantly outperforms baseline models in generating coherent and clinically plausible disease trajectories. Two user studies by veteran physicians, provide further validation and insights into the clinical utility of the generated sequences. MVG has the potential to assist healthcare providers in modeling disease trajectories, interpolating missing medical image data, and enhancing medical education through realistic, dynamic visualizations of disease progression.
LGApr 20, 2024
TrialDura: Hierarchical Attention Transformer for Interpretable Clinical Trial Duration PredictionLing Yue, Jonathan Li, Sixue Xing et al.
The clinical trial process, a critical phase in drug development, is essential for developing new treatments. The primary goal of interventional clinical trials is to evaluate the safety and efficacy of drug-based treatments for specific diseases. However, these trials are often lengthy, labor-intensive, and expensive. The duration of a clinical trial significantly impacts overall costs, making efficient timeline management crucial for controlling budgets and ensuring the economic feasibility of research. To address this issue, We propose TrialDura, a machine learning-based method that estimates the duration of clinical trials using multimodal data, including disease names, drug molecules, trial phases, and eligibility criteria. Then, we encode them into Bio-BERT embeddings specifically tuned for biomedical contexts to provide a deeper and more relevant semantic understanding of clinical trial data. Finally, the model's hierarchical attention mechanism connects all of the embeddings to capture their interactions and predict clinical trial duration. Our proposed model demonstrated superior performance with a mean absolute error (MAE) of 1.04 years and a root mean square error (RMSE) of 1.39 years compared to the other models, indicating more accurate clinical trial duration prediction. Publicly available code can be found at: https://anonymous.4open.science/r/TrialDura-F196.
LGApr 17, 2024
Personalized Heart Disease Detection via ECG Digital Twin GenerationYaojun Hu, Jintai Chen, Lianting Hu et al.
Heart diseases rank among the leading causes of global mortality, demonstrating a crucial need for early diagnosis and intervention. Most traditional electrocardiogram (ECG) based automated diagnosis methods are trained at population level, neglecting the customization of personalized ECGs to enhance individual healthcare management. A potential solution to address this limitation is to employ digital twins to simulate symptoms of diseases in real patients. In this paper, we present an innovative prospective learning approach for personalized heart disease detection, which generates digital twins of healthy individuals' anomalous ECGs and enhances the model sensitivity to the personalized symptoms. In our approach, a vector quantized feature separator is proposed to locate and isolate the disease symptom and normal segments in ECG signals with ECG report guidance. Thus, the ECG digital twins can simulate specific heart diseases used to train a personalized heart disease detection model. Experiments demonstrate that our approach not only excels in generating high-fidelity ECG signals but also improves personalized heart disease detection. Moreover, our approach ensures robust privacy protection, safeguarding patient data in model development.
LGDec 28, 2024
ProtCLIP: Function-Informed Protein Multi-Modal LearningHanjing Zhou, Mingze Yin, Wei Wu et al.
Multi-modality pre-training paradigm that aligns protein sequences and biological descriptions has learned general protein representations and achieved promising performance in various downstream applications. However, these works were still unable to replicate the extraordinary success of language-supervised visual foundation models due to the ineffective usage of aligned protein-text paired data and the lack of an effective function-informed pre-training paradigm. To address these issues, this paper curates a large-scale protein-text paired dataset called ProtAnno with a property-driven sampling strategy, and introduces a novel function-informed protein pre-training paradigm. Specifically, the sampling strategy determines selecting probability based on the sample confidence and property coverage, balancing the data quality and data quantity in face of large-scale noisy data. Furthermore, motivated by significance of the protein specific functional mechanism, the proposed paradigm explicitly model protein static and dynamic functional segments by two segment-wise pre-training objectives, injecting fine-grained information in a function-informed manner. Leveraging all these innovations, we develop ProtCLIP, a multi-modality foundation model that comprehensively represents function-aware protein embeddings. On 22 different protein benchmarks within 5 types, including protein functionality classification, mutation effect prediction, cross-modal transformation, semantic similarity inference and protein-protein interaction prediction, our ProtCLIP consistently achieves SOTA performance, with remarkable improvements of 75% on average in five cross-modal transformation benchmarks, 59.9% in GO-CC and 39.7% in GO-BP protein function prediction. The experimental results verify the extraordinary potential of ProtCLIP serving as the protein multi-modality foundation model.
CVNov 18, 2024
Scalable Autoregressive Monocular Depth EstimationJinhong Wang, Jian Liu, Dongqi Tang et al.
This paper shows that the autoregressive model is an effective and scalable monocular depth estimator. Our idea is simple: We tackle the monocular depth estimation (MDE) task with an autoregressive prediction paradigm, based on two core designs. First, our depth autoregressive model (DAR) treats the depth map of different resolutions as a set of tokens, and conducts the low-to-high resolution autoregressive objective with a patch-wise casual mask. Second, our DAR recursively discretizes the entire depth range into more compact intervals, and attains the coarse-to-fine granularity autoregressive objective in an ordinal-regression manner. By coupling these two autoregressive objectives, our DAR establishes new state-of-the-art (SOTA) on KITTI and NYU Depth v2 by clear margins. Further, our scalable approach allows us to scale the model up to 2.0B and achieve the best RMSE of 1.799 on the KITTI dataset (5% improvement) compared to 1.896 by the current SOTA (Depth Anything). DAR further showcases zero-shot generalization ability on unseen datasets. These results suggest that DAR yields superior performance with an autoregressive prediction paradigm, providing a promising approach to equip modern autoregressive large models (e.g., GPT-4o) with depth estimation capabilities.
LGMar 3, 2025
Foundation Model in BiomedicineXiangrui Liu, Yuanyuan Zhang, Qianyu Shang et al.
Foundation models, first introduced in 2021, refer to large-scale pretrained models (e.g., large language models (LLMs) and vision-language models (VLMs)) that learn from extensive unlabeled datasets through unsupervised methods, enabling them to excel in diverse downstream tasks. These models, like GPT, can be adapted to various applications such as question answering and visual understanding, outperforming task-specific AI models and earning their name due to broad applicability across fields. The development of biomedical foundation models marks a significant milestone in the use of artificial intelligence (AI) to understand complex biological phenomena and advance medical research and practice. This survey explores the potential of foundation models in diverse domains within biomedical fields, including computational biology, drug discovery and development, clinical informatics, medical imaging, and public health. The purpose of this survey is to inspire ongoing research in the application of foundation models to health science.
SPNov 17, 2024
AnyECG: Foundational Models for Multitask Cardiac Analysis in Real-World SettingsYue Wang, Xu Cao, Yaojun Hu et al.
Electrocardiogram (ECG), a non-invasive and affordable tool for cardiac monitoring, is highly sensitive in detecting acute heart attacks. However, due to the lengthy nature of ECG recordings, numerous machine learning methods have been developed for automated heart disease detection to reduce human workload. Despite these efforts, performance remains suboptimal. A key obstacle is the inherent complexity of ECG data, which includes heterogeneity (e.g., varying sampling rates), high levels of noise, demographic-related pattern shifts, and intricate rhythm-event associations. To overcome these challenges, this paper introduces AnyECG, a foundational model designed to extract robust representations from any real-world ECG data. Specifically, a tailored ECG Tokenizer encodes each fixed-duration ECG fragment into a token and, guided by proxy tasks, converts noisy, continuous ECG features into discrete, compact, and clinically meaningful local rhythm codes. These codes encapsulate basic morphological, frequency, and demographic information (e.g., sex), effectively mitigating signal noise. We further pre-train the AnyECG to learn rhythmic pattern associations across ECG tokens, enabling the capture of cardiac event semantics. By being jointly pre-trained on diverse ECG data sources, AnyECG is capable of generalizing across a wide range of downstream tasks where ECG signals are recorded from various devices and scenarios. The experimental results show that AnyECG achieves an average performance improvement of 6% across four critical tasks-anomaly detection, arrhythmia classification, corrupted lead generation, and ultra-long ECG recognition. AnyECG learns common ECG rhythm from data and significantly outperforms state-of-the-art methods in each of these tasks.
CVApr 11, 2024
Multi-rater Prompting for Ambiguous Medical Image SegmentationJinhong Wang, Yi Cheng, Jintai Chen et al.
Multi-rater annotations commonly occur when medical images are independently annotated by multiple experts (raters). In this paper, we tackle two challenges arisen in multi-rater annotations for medical image segmentation (called ambiguous medical image segmentation): (1) How to train a deep learning model when a group of raters produces a set of diverse but plausible annotations, and (2) how to fine-tune the model efficiently when computation resources are not available for re-training the entire model on a different dataset domain. We propose a multi-rater prompt-based approach to address these two challenges altogether. Specifically, we introduce a series of rater-aware prompts that can be plugged into the U-Net model for uncertainty estimation to handle multi-annotation cases. During the prompt-based fine-tuning process, only 0.3% of learnable parameters are required to be updated comparing to training the entire model. Further, in order to integrate expert consensus and disagreement, we explore different multi-rater incorporation strategies and design a mix-training strategy for comprehensive insight learning. Extensive experiments verify the effectiveness of our new approach for ambiguous medical image segmentation on two public datasets while alleviating the heavy burden of model re-training.
CLMar 3, 2024
Small Models are LLM Knowledge Triggers on Medical Tabular PredictionJiahuan Yan, Jintai Chen, Chaowen Hu et al.
Recent development in large language models (LLMs) has demonstrated impressive domain proficiency on unstructured textual or multi-modal tasks. However, despite with intrinsic world knowledge, their application on structured tabular data prediction still lags behind, primarily due to the numerical insensitivity and modality discrepancy that brings a gap between LLM reasoning and statistical tabular learning. Unlike textual or vision data (e.g., electronic clinical notes or medical imaging data), tabular data is often presented in heterogeneous numerical values (e.g., CBC reports). This ubiquitous data format requires intensive expert annotation, and its numerical nature limits LLMs' capability to effectively transfer untapped domain expertise. In this paper, we propose SERSAL, a general self-prompting method by synergy learning with small models to enhance LLM tabular prediction in an unsupervised manner. Specifically, SERSAL utilizes the LLM's prior outcomes as original soft noisy annotations, which are dynamically leveraged to teach a better small student model. Reversely, the outcomes from the trained small model are used to teach the LLM to further refine its real capability. This process can be repeatedly applied to gradually distill refined knowledge for continuous progress. Comprehensive experiments on widely used medical domain tabular datasets show that, without access to gold labels, applying SERSAL to OpenAI GPT reasoning process attains substantial improvement compared to linguistic prompting methods, which serves as an orthogonal direction for tabular LLM, and increasing prompting bonus is observed as more powerful LLMs appear.