ROAug 16, 2023
Autoencoding a Soft Touch to Learn Grasping from On-land to UnderwaterNing Guo, Xudong Han, Xiaobo Liu et al.
Robots play a critical role as the physical agent of human operators in exploring the ocean. However, it remains challenging to grasp objects reliably while fully submerging under a highly pressurized aquatic environment with little visible light, mainly due to the fluidic interference on the tactile mechanics between the finger and object surfaces. This study investigates the transferability of grasping knowledge from on-land to underwater via a vision-based soft robotic finger that learns 6D forces and torques (FT) using a Supervised Variational Autoencoder (SVAE). A high-framerate camera captures the whole-body deformations while a soft robotic finger interacts with physical objects on-land and underwater. Results show that the trained SVAE model learned a series of latent representations of the soft mechanics transferrable from land to water, presenting a superior adaptation to the changing environments against commercial FT sensors. Soft, delicate, and reactive grasping enabled by tactile intelligence enhances the gripper's underwater interaction with improved reliability and robustness at a much-reduced cost, paving the path for learning-based intelligent grasping to support fundamental scientific discoveries in environmental and ocean research.
CVMay 23, 2025Code
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous DrivingShuang Zeng, Xinyuan Chang, Mengwei Xie et al.
Vision-Language-Action (VLA) models offer significant potential for end-to-end driving, yet their reasoning is often constrained by textual Chains-of-Thought (CoT). This symbolic compression of visual information creates a modality gap between perception and planning by blurring spatio-temporal relations and discarding fine-grained cues. We introduce FSDrive, a framework that empowers VLAs to "think visually" using a novel visual spatio-temporal CoT. FSDrive first operates as a world model, generating a unified future frame that combines a predicted background with explicit, physically-plausible priors like future lane dividers and 3D object boxes. This imagined scene serves as the visual spatio-temporal CoT, capturing both spatial structure and temporal evolution in a single representation. The same VLA then functions as an inverse-dynamics model to plan trajectories conditioned on current observations and this visual CoT. We enable this with a unified pre-training paradigm that expands the model's vocabulary with visual tokens and jointly optimizes for semantic understanding (VQA) and future-frame prediction. A progressive curriculum first generates structural priors to enforce physical laws before rendering the full scene. Evaluations on nuScenes and NAVSIM show FSDrive improves trajectory accuracy and reduces collisions, while also achieving competitive FID for video generation with a lightweight autoregressive model and advancing scene understanding on DriveLM. These results confirm that our visual spatio-temporal CoT bridges the perception-planning gap, enabling safer, more anticipatory autonomous driving. Code is available at https://github.com/MIV-XJTU/FSDrive.
IVJan 3, 2023
PMT-IQA: Progressive Multi-task Learning for Blind Image Quality AssessmentQingyi Pan, Ning Guo, Letu Qingge et al.
Blind image quality assessment (BIQA) remains challenging due to the diversity of distortion and image content variation, which complicate the distortion patterns crossing different scales and aggravate the difficulty of the regression problem for BIQA. However, existing BIQA methods often fail to consider multi-scale distortion patterns and image content, and little research has been done on learning strategies to make the regression model produce better performance. In this paper, we propose a simple yet effective Progressive Multi-Task Image Quality Assessment (PMT-IQA) model, which contains a multi-scale feature extraction module (MS) and a progressive multi-task learning module (PMT), to help the model learn complex distortion patterns and better optimize the regression issue to align with the law of human learning process from easy to hard. To verify the effectiveness of the proposed PMT-IQA model, we conduct experiments on four widely used public datasets, and the experimental results indicate that the performance of PMT-IQA is superior to the comparison approaches, and both MS and PMT modules improve the model's performance.
3.8GRMay 1
Efficient B-Spline Finite Elements for Cloth SimulationYuqi Meng, Yihao Shi, Kemeng Huang et al.
We present an efficient B-spline finite element method (FEM) for cloth simulation. While higher-order FEM has long promised higher accuracy, its adoption in cloth simulators has been limited by its larger computational costs while generating results with similar visual quality. Our contribution is a full algorithmic pipeline that makes cloth simulation using quadratic B-spline surfaces faster than standard linear FEM in practice while consistently improving accuracy and visual fidelity. Using quadratic B-spline basis functions, we obtain a globally $C^1$-continuous displacement field that supports consistent discretization of both membrane and bending energies, effectively reducing locking artifacts and mesh dependence common to linear elements. To close the performance gap, we introduce a reduced integration scheme that separately optimizes quadrature rules for membrane and bending energies, an accelerated Hessian assembly procedure tailored to the spline structure, and an optimized linear solver based on partial factorization. Together, these optimizations make high-order, smooth cloth simulation competitive at scale, yielding an average $2\times$ speedup over linear FEM in our tests. Extensive experiments demonstrate improved accuracy, wrinkle detail, and robustness, including contact-rich scenarios, relative to linear FEM and recent higher-order approaches. Our method enables realistic wrinkling dynamics across a wide range of material parameters and supports practical garment animation, providing a new promising spatial discretization for high-quality cloth simulation.
AIDec 31, 2025
AMAP Agentic Planning Technical ReportAMAP AI Agent Team, Yulan Hu, Xiangwen Zhang et al.
We present STAgent, an agentic large language model tailored for spatio-temporal understanding, designed to solve complex tasks such as constrained point-of-interest discovery and itinerary planning. STAgent is a specialized model capable of interacting with ten distinct tools within spatio-temporal scenarios, enabling it to explore, verify, and refine intermediate steps during complex reasoning. Notably, STAgent effectively preserves its general capabilities. We empower STAgent with these capabilities through three key contributions: (1) a stable tool environment that supports over ten domain-specific tools, enabling asynchronous rollout and training; (2) a hierarchical data curation framework that identifies high-quality data like a needle in a haystack, curating high-quality queries by retaining less than 1\% of the raw data, emphasizing both diversity and difficulty; and (3) a cascaded training recipe that starts with a seed SFT stage acting as a guardian to measure query difficulty, followed by a second SFT stage fine-tuned on queries with high certainty, and an ultimate RL stage that leverages data of low certainty. Initialized with Qwen3-30B-A3B to establish a strong SFT foundation and leverage insights into sample difficulty, STAgent yields promising performance on TravelBench while maintaining its general capabilities across a wide range of general benchmarks, thereby demonstrating the effectiveness of our proposed agentic model.
LGDec 27, 2023Code
Adaptive trajectory-constrained exploration strategy for deep reinforcement learningGuojian Wang, Faguo Wu, Xiao Zhang et al.
Deep reinforcement learning (DRL) faces significant challenges in addressing the hard-exploration problems in tasks with sparse or deceptive rewards and large state spaces. These challenges severely limit the practical application of DRL. Most previous exploration methods relied on complex architectures to estimate state novelty or introduced sensitive hyperparameters, resulting in instability. To mitigate these issues, we propose an efficient adaptive trajectory-constrained exploration strategy for DRL. The proposed method guides the policy of the agent away from suboptimal solutions by leveraging incomplete offline demonstrations as references. This approach gradually expands the exploration scope of the agent and strives for optimality in a constrained optimization manner. Additionally, we introduce a novel policy-gradient-based optimization algorithm that utilizes adaptively clipped trajectory-distance rewards for both single- and multi-agent reinforcement learning. We provide a theoretical analysis of our method, including a deduction of the worst-case approximation error bounds, highlighting the validity of our approach for enhancing exploration. To evaluate the effectiveness of the proposed method, we conducted experiments on two large 2D grid world mazes and several MuJoCo tasks. The extensive experimental results demonstrate the significant advantages of our method in achieving temporally extended exploration and avoiding myopic and suboptimal behaviors in both single- and multi-agent settings. Notably, the specific metrics and quantifiable results further support these findings. The code used in the study is available at \url{https://github.com/buaawgj/TACE}.
CVSep 26, 2025Code
UniMapGen: A Generative Framework for Large-Scale Map Construction from Multi-modal DataYujian Yuan, Changjie Wu, Xinyuan Chang et al.
Large-scale map construction plays a vital role in applications like autonomous driving and navigation systems. Traditional large-scale map construction approaches mainly rely on costly and inefficient special data collection vehicles and labor-intensive annotation processes. While existing satellite-based methods have demonstrated promising potential in enhancing the efficiency and coverage of map construction, they exhibit two major limitations: (1) inherent drawbacks of satellite data (e.g., occlusions, outdatedness) and (2) inefficient vectorization from perception-based methods, resulting in discontinuous and rough roads that require extensive post-processing. This paper presents a novel generative framework, UniMapGen, for large-scale map construction, offering three key innovations: (1) representing lane lines as \textbf{discrete sequence} and establishing an iterative strategy to generate more complete and smooth map vectors than traditional perception-based methods. (2) proposing a flexible architecture that supports \textbf{multi-modal} inputs, enabling dynamic selection among BEV, PV, and text prompt, to overcome the drawbacks of satellite data. (3) developing a \textbf{state update} strategy for global continuity and consistency of the constructed large-scale map. UniMapGen achieves state-of-the-art performance on the OpenSatMap dataset. Furthermore, UniMapGen can infer occluded roads and predict roads missing from dataset annotations. Our code will be released.
CVMar 4, 2025
Developing a PET/CT Foundation Model for Cross-Modal Anatomical and Functional ImagingYujin Oh, Robert Seifert, Yihan Cao et al.
In oncology, Positron Emission Tomography-Computed Tomography (PET/CT) is widely used in cancer diagnosis, staging, and treatment monitoring, as it combines anatomical details from CT with functional metabolic activity and molecular marker expression information from PET. However, existing artificial intelligence-driven PET/CT analyses rely predominantly on task-specific models trained from scratch or on limited datasets, limiting their generalizability and robustness. To address this, we propose a foundation model approach specifically designed for multimodal PET/CT imaging. We introduce the Cross-Fraternal Twin Masked Autoencoder (FratMAE), a novel framework that effectively integrates whole-body anatomical and functional or molecular information. FratMAE employs separate Vision Transformer (ViT) encoders for PET and CT scans, along with cross-attention decoders that enable synergistic interactions between modalities during masked autoencoder training. Additionally, it incorporates textual metadata to enhance PET representation learning. By pre-training on PET/CT datasets, FratMAE captures intricate cross-modal relationships and global uptake patterns, achieving superior performance on downstream tasks and demonstrating its potential as a generalizable foundation model.
RONov 19, 2025
Look, Zoom, Understand: The Robotic Eyeball for Embodied PerceptionJiashu Yang, Yifan Han, Yucheng Xie et al.
In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spatial budget constraints. Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. To address this issue, we propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent. EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autoregressive sequence. By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open-world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data. Experiments show that our system efficiently performs instructed scenes in real-world environments and actively acquires more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities. EyeVLA introduces a novel robotic vision system that leverages detailed and spatially rich, large-scale embodied data, and actively acquires highly informative visual observations for downstream embodied tasks.
RONov 26, 2025
SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied NavigationZiyi Chen, Yingnan Guo, Zedong Chu et al.
Embodied navigation that adheres to social norms remains an open research challenge. Our SocialNav is a foundational model for socially-aware navigation with a hierarchical "brain-action" architecture, capable of understanding high-level social norms and generating low-level, socially compliant trajectories. To enable such dual capabilities, we construct the SocNav Dataset, a large-scale collection of 7 million samples, comprising (1) a Cognitive Activation Dataset providing social reasoning signals such as chain-of-thought explanations and social traversability prediction, and (2) an Expert Trajectories Pyramid aggregating diverse navigation demonstrations from internet videos, simulated environments, and real-world robots. A multi-stage training pipeline is proposed to gradually inject and refine navigation intelligence: we first inject general navigation skills and social norms understanding into the model via imitation learning, and then refine such skills through a deliberately designed Socially-Aware Flow Exploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly rewards socially compliant behaviors. SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method, demonstrating strong gains in both navigation performance and social compliance. Our project page: https://amap-eai.github.io/SocialNav/
CVMay 8, 2025
OWT: A Foundational Organ-Wise Tokenization Framework for Medical ImagingSifan Song, Siyeop Yoon, Pengfei Jin et al.
Recent advances in representation learning often rely on holistic embeddings that entangle multiple semantic components, limiting interpretability and generalization. These issues are especially critical in medical imaging, where downstream tasks depend on anatomically interpretable features. To address these limitations, we propose an Organ-Wise Tokenization (OWT) framework with a Token Group-based Reconstruction (TGR) training paradigm. Unlike conventional approaches, OWT explicitly disentangles an image into separable token groups, each corresponding to a distinct organ or semantic entity. Our design ensures each token group encapsulates organ-specific information, boosting interpretability, generalization, and efficiency while enabling fine-grained control for targeted clinical applications. Experiments on CT and MRI datasets demonstrate OWT's power: it not only achieves strong performance on standard tasks like image reconstruction and segmentation, but also unlocks novel, high-impact clinical capabilities including organ-specific tumor identification, organ-level retrieval and semantic-level generation, without requiring any additional training. These findings underscore the potential of OWT as a foundational framework for semantically disentangled representation learning, offering broad scalability and a new perspective on how representations can be leveraged.
CVMar 18, 2025
MAST-Pro: Dynamic Mixture-of-Experts for Adaptive Segmentation of Pan-Tumors with Knowledge-Driven PromptsRunqi Meng, Sifan Song, Pengfei Jin et al.
Accurate tumor segmentation is crucial for cancer diagnosis and treatment. While foundation models have advanced general-purpose segmentation, existing methods still struggle with: (1) limited incorporation of medical priors, (2) imbalance between generic and tumor-specific features, and (3) high computational costs for clinical adaptation. To address these challenges, we propose MAST-Pro (Mixture-of-experts for Adaptive Segmentation of pan-Tumors with knowledge-driven Prompts), a novel framework that integrates dynamic Mixture-of-Experts (D-MoE) and knowledge-driven prompts for pan-tumor segmentation. Specifically, text and anatomical prompts provide domain-specific priors, guiding tumor representation learning, while D-MoE dynamically selects experts to balance generic and tumor-specific feature learning, improving segmentation accuracy across diverse tumor types. To enhance efficiency, we employ Parameter-Efficient Fine-Tuning (PEFT), optimizing MAST-Pro with significantly reduced computational overhead. Experiments on multi-anatomical tumor datasets demonstrate that MAST-Pro outperforms state-of-the-art approaches, achieving up to a 5.20% improvement in average DSC while reducing trainable parameters by 91.04%, without compromising accuracy.
CLMay 11, 2023
GeoGLUE: A GeoGraphic Language Understanding Evaluation BenchmarkDongyang Li, Ruixue Ding, Qiang Zhang et al.
With a fast developing pace of geographic applications, automatable and intelligent models are essential to be designed to handle the large volume of information. However, few researchers focus on geographic natural language processing, and there has never been a benchmark to build a unified standard. In this work, we propose a GeoGraphic Language Understanding Evaluation benchmark, named GeoGLUE. We collect data from open-released geographic resources and introduce six natural language understanding tasks, including geographic textual similarity on recall, geographic textual similarity on rerank, geographic elements tagging, geographic composition analysis, geographic where what cut, and geographic entity alignment. We also pro vide evaluation experiments and analysis of general baselines, indicating the effectiveness and significance of the GeoGLUE benchmark.
CLNov 26, 2021
Simple Contrastive Representation Adversarial Learning for NLP TasksDeshui Miao, Jiaqi Zhang, Wenbo Xie et al.
Self-supervised learning approach like contrastive learning is attached great attention in natural language processing. It uses pairs of training data augmentations to build a classification task for an encoder with well representation ability. However, the construction of learning pairs over contrastive learning is much harder in NLP tasks. Previous works generate word-level changes to form pairs, but small transforms may cause notable changes on the meaning of sentences as the discrete and sparse nature of natural language. In this paper, adversarial training is performed to generate challenging and harder learning adversarial examples over the embedding space of NLP as learning pairs. Using contrastive learning improves the generalization ability of adversarial training because contrastive loss can uniform the sample distribution. And at the same time, adversarial training also enhances the robustness of contrastive learning. Two novel frameworks, supervised contrastive adversarial learning (SCAL) and unsupervised SCAL (USCAL), are proposed, which yields learning pairs by utilizing the adversarial training for contrastive learning. The label-based loss of supervised tasks is exploited to generate adversarial examples while unsupervised tasks bring contrastive loss. To validate the effectiveness of the proposed framework, we employ it to Transformer-based models for natural language understanding, sentence semantic textual similarity and adversarial learning tasks. Experimental results on GLUE benchmark tasks show that our fine-tuned supervised method outperforms BERT$_{base}$ over 1.75\%. We also evaluate our unsupervised method on semantic textual similarity (STS) tasks, and our method gets 77.29\% with BERT$_{base}$. The robustness of our approach conducts state-of-the-art results under multiple adversarial datasets on NLI tasks.
IVNov 26, 2020
Deep Metric Learning-based Image Retrieval System for Chest Radiograph and its Clinical Applications in COVID-19Aoxiao Zhong, Xiang Li, Dufan Wu et al.
In recent years, deep learning-based image analysis methods have been widely applied in computer-aided detection, diagnosis and prognosis, and has shown its value during the public health crisis of the novel coronavirus disease 2019 (COVID-19) pandemic. Chest radiograph (CXR) has been playing a crucial role in COVID-19 patient triaging, diagnosing and monitoring, particularly in the United States. Considering the mixed and unspecific signals in CXR, an image retrieval model of CXR that provides both similar images and associated clinical information can be more clinically meaningful than a direct image diagnostic model. In this work we develop a novel CXR image retrieval model based on deep metric learning. Unlike traditional diagnostic models which aims at learning the direct mapping from images to labels, the proposed model aims at learning the optimized embedding space of images, where images with the same labels and similar contents are pulled together. It utilizes multi-similarity loss with hard-mining sampling strategy and attention mechanism to learn the optimized embedding space, and provides similar images to the query image. The model is trained and validated on an international multi-site COVID-19 dataset collected from 3 different sources. Experimental results of COVID-19 image retrieval and diagnosis tasks show that the proposed model can serve as a robust solution for CXR analysis and patient management for COVID-19. The model is also tested on its transferability on a different clinical decision support task, where the pre-trained model is applied to extract image features from a new dataset without any further training. These results demonstrate our deep metric learning based image retrieval model is highly efficient in the CXR retrieval, diagnosis and prognosis, and thus has great clinical value for the treatment and management of COVID-19 patients.
MED-PHSep 13, 2020
Clinically Translatable Direct Patlak Reconstruction from Dynamic PET with Motion Correction Using Convolutional Neural NetworkNuobei Xie, Kuang Gong, Ning Guo et al.
Patlak model is widely used in 18F-FDG dynamic positron emission tomography (PET) imaging, where the estimated parametric images reveal important biochemical and physiology information. Because of better noise modeling and more information extracted from raw sinogram, direct Patlak reconstruction gains its popularity over the indirect approach which utilizes reconstructed dynamic PET images alone. As the prerequisite of direct Patlak methods, raw data from dynamic PET are rarely stored in clinics and difficult to obtain. In addition, the direct reconstruction is time-consuming due to the bottleneck of multiple-frame reconstruction. All of these impede the clinical adoption of direct Patlak reconstruction.In this work, we proposed a data-driven framework which maps the dynamic PET images to the high-quality motion-corrected direct Patlak images through a convolutional neural network. For the patient motion during the long period of dynamic PET scan, we combined the correction with the backward/forward projection in direct reconstruction to better fit the statistical model. Results based on fifteen clinical 18F-FDG dynamic brain PET datasets demonstrates the superiority of the proposed framework over Gaussian, nonlocal mean and BM4D denoising, regarding the image bias and contrast-to-noise ratio.
MED-PHDec 16, 2019
Penalized-likelihood PET Image Reconstruction Using 3D Structural Convolutional Sparse CodingNuobei Xie, Kuang Gong, Ning Guo et al.
Positron emission tomography (PET) is widely used for clinical diagnosis. As PET suffers from low resolution and high noise, numerous efforts try to incorporate anatomical priors into PET image reconstruction, especially with the development of hybrid PET/CT and PET/MRI systems. In this work, we proposed a novel 3D structural convolutional sparse coding (CSC) concept for penalized-likelihood PET image reconstruction, named 3D PET-CSC. The proposed 3D PET-CSC takes advantage of the convolutional operation and manages to incorporate anatomical priors without the need of registration or supervised training. As 3D PET-CSC codes the whole 3D PET image, instead of patches, it alleviates the staircase artifacts commonly presented in traditional patch-based sparse coding methods. Moreover, we developed the residual-image and order-subset mechanisms to further reduce the computational cost and accelerate the convergence for the proposed 3D PET-CSC method. Experiments based on computer simulations and clinical datasets demonstrate the superiority of 3D PET-CSC compared with other reference methods.
IVOct 7, 2019
Multi-label Detection and Classification of Red Blood Cells in Microscopic ImagesWei Qiu, Jiaming Guo, Xiang Li et al.
Cell detection and cell type classification from biomedical images play an important role for high-throughput imaging and various clinical application. While classification of single cell sample can be performed with standard computer vision and machine learning methods, analysis of multi-label samples (region containing congregating cells) is more challenging, as separation of individual cells can be difficult (e.g. touching cells) or even impossible (e.g. overlapping cells). As multi-instance images are common in analyzing Red Blood Cell (RBC) for Sickle Cell Disease (SCD) diagnosis, we develop and implement a multi-instance cell detection and classification framework to address this challenge. The framework firstly trains a region proposal model based on Region-based Convolutional Network (RCNN) to obtain bounding-boxes of regions potentially containing single or multiple cells from input microscopic images, which are extracted as image patches. High-level image features are then calculated from image patches through a pre-trained Convolutional Neural Network (CNN) with ResNet-50 structure. Using these image features inputs, six networks are then trained to make multi-label prediction of whether a given patch contains cells belonging to a specific cell type. As the six networks are trained with image patches consisting of both individual cells and touching/overlapping cells, they can effectively recognize cell types that are presented in multi-instance image samples. Finally, for the purpose of SCD testing, we train another machine learning classifier to predict whether the given image patch contains abnormal cell type based on outputs from the six networks. Testing result of the proposed framework shows that it can achieve good performance in automatic cell detection and classification.
LGOct 1, 2019
Predicting Alzheimer's Disease by Hierarchical Graph Convolution from Positron Emission Tomography ImagingJiaming Guo, Wei Qiu, Xiang Li et al.
Imaging-based early diagnosis of Alzheimer Disease (AD) has become an effective approach, especially by using nuclear medicine imaging techniques such as Positron Emission Topography (PET). In various literature it has been found that PET images can be better modeled as signals (e.g. uptake of florbetapir) defined on a network (non-Euclidean) structure which is governed by its underlying graph patterns of pathological progression and metabolic connectivity. In order to effectively apply deep learning framework for PET image analysis to overcome its limitation on Euclidean grid, we develop a solution for 3D PET image representation and analysis under a generalized, graph-based CNN architecture (PETNet), which analyzes PET signals defined on a group-wise inferred graph structure. Computations in PETNet are defined in non-Euclidean, graph (network) domain, as it performs feature extraction by convolution operations on spectral-filtered signals on the graph and pooling operations based on hierarchical graph clustering. Effectiveness of the PETNet is evaluated on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, which shows improved performance over both deep learning and other machine learning-based methods.
MLOct 1, 2018
Network Modeling and Pathway Inference from Incomplete Data ("PathInf")Xiang Li, Qitian Chen, Xing Wang et al.
In this work, we developed a network inference method from incomplete data ("PathInf") , as massive and non-uniformly distributed missing values is a common challenge in practical problems. PathInf is a two-stages inference model. In the first stage, it applies a data summarization model based on maximum likelihood to deal with the massive distributed missing values by transforming the observation-wise items in the data into state matrix. In the second stage, transition pattern (i.e. pathway) among variables is inferred as a graph inference problem solved by greedy algorithm with constraints. The proposed method was validated and compared with the state-of-art Bayesian network method on the simulation data, and shown consistently superior performance. By applying the PathInf on the lymph vascular metastasis data, we obtained the holistic pathways of the lymph node metastasis with novel discoveries on the jumping metastasis among nodes that are physically apart. The discovery indicates the possible presence of sentinel node groups in the lung lymph nodes which have been previously speculated yet never found. The pathway map can also improve the current dissection examination protocol for better individualized treatment planning, for higher diagnostic accuracy and reducing the patients trauma.
CVJul 4, 2018
Learning Personalized Representation for Inverse Problems in Medical Imaging Using Deep Neural NetworkKuang Gong, Kyungsang Kim, Jianan Cui et al.
Recently deep neural networks have been widely and successfully applied in computer vision tasks and attracted growing interests in medical imaging. One barrier for the application of deep neural networks to medical imaging is the need of large amounts of prior training pairs, which is not always feasible in clinical practice. In this work we propose a personalized representation learning framework where no prior training pairs are needed, but only the patient's own prior images. The representation is expressed using a deep neural network with the patient's prior images as network input. We then applied this novel image representation to inverse problems in medical imaging in which the original inverse problem was formulated as a constraint optimization problem and solved using the alternating direction method of multipliers (ADMM) algorithm. Anatomically guided brain positron emission tomography (PET) image reconstruction and image denoising were employed as examples to demonstrate the effectiveness of the proposed framework. Quantification results based on simulation and real datasets show that the proposed personalized representation framework outperform other widely adopted methods.
CVOct 31, 2017
Medical Image Segmentation Based on Multi-Modal Convolutional Neural Network: Study on Image Fusion SchemesZhe Guo, Xiang Li, Heng Huang et al.
Image analysis using more than one modality (i.e. multi-modal) has been increasingly applied in the field of biomedical imaging. One of the challenges in performing the multimodal analysis is that there exist multiple schemes for fusing the information from different modalities, where such schemes are application-dependent and lack a unified framework to guide their designs. In this work we firstly propose a conceptual architecture for the image fusion schemes in supervised biomedical image analysis: fusing at the feature level, fusing at the classifier level, and fusing at the decision-making level. Further, motivated by the recent success in applying deep learning for natural image analysis, we implement the three image fusion schemes above based on the Convolutional Neural Network (CNN) with varied structures, and combined into a single framework. The proposed image segmentation framework is capable of analyzing the multi-modality images using different fusing schemes simultaneously. The framework is applied to detect the presence of soft tissue sarcoma from the combination of Magnetic Resonance Imaging (MRI), Computed Tomography (CT) and Positron Emission Tomography (PET) images. It is found from the results that while all the fusion schemes outperform the single-modality schemes, fusing at the feature level can generally achieve the best performance in terms of both accuracy and computational cost, but also suffers from the decreased robustness in the presence of large errors in any image modalities.
CVJul 19, 2017
Self-paced Convolutional Neural Network for Computer Aided Detection in Medical Imaging AnalysisXiang Li, Aoxiao Zhong, Ming Lin et al.
Tissue characterization has long been an important component of Computer Aided Diagnosis (CAD) systems for automatic lesion detection and further clinical planning. Motivated by the superior performance of deep learning methods on various computer vision problems, there has been increasing work applying deep learning to medical image analysis. However, the development of a robust and reliable deep learning model for computer-aided diagnosis is still highly challenging due to the combination of the high heterogeneity in the medical images and the relative lack of training samples. Specifically, annotation and labeling of the medical images is much more expensive and time-consuming than other applications and often involves manual labor from multiple domain experts. In this work, we propose a multi-stage, self-paced learning framework utilizing a convolutional neural network (CNN) to classify Computed Tomography (CT) image patches. The key contribution of this approach is that we augment the size of training samples by refining the unlabeled instances with a self-paced learning CNN. By implementing the framework on high performance computing servers including the NVIDIA DGX1 machine, we obtained the experimental result, showing that the self-pace boosted network consistently outperformed the original network even with very scarce manual labels. The performance gain indicates that applications with limited training samples such as medical image analysis can benefit from using the proposed framework.