CVJul 31, 2023
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingEnxin Song, Wenhao Chai, Guanhong Wang et al.
Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose the MovieChat to overcome these challenges. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for validation of the effectiveness of our method.
LGFeb 2, 2023
Double Equivariance for Inductive Link Prediction for Both New Nodes and New Relation TypesJincheng Zhou, Yucheng Zhang, Jianfei Gao et al.
The task of fully inductive link prediction in knowledge graphs has gained significant attention, with various graph neural networks being proposed to address it. This task presents greater challenges than traditional inductive link prediction tasks with only new nodes, as models must be capable of zero-shot generalization to both unseen nodes and unseen relation types in the inference graph. Despite the development of novel models, a unifying theoretical understanding of their success remains elusive, and the limitations of these methods are not well-studied. In this work, we introduce the concept of double permutation-equivariant representations and demonstrate its necessity for effective performance in this task. We show that many existing models, despite their diverse architectural designs, conform to this framework. However, we also identify inherent limitations in double permutation-equivariant representations, which restrict these models's ability to learn effectively on datasets with varying characteristics. Our findings suggest that while double equivariance is necessary for meta-learning across knowledge graphs from different domains, it is not sufficient. There remains a fundamental gap between double permutation-equivariant models and the concept of foundation models designed to learn patterns across all domains.
CVApr 13, 2023
PALF: Pre-Annotation and Camera-LiDAR Late Fusion for the Easy Annotation of Point CloudsYucheng Zhang, Masaki Fukuda, Yasunori Ishii et al.
3D object detection has become indispensable in the field of autonomous driving. To date, gratifying breakthroughs have been recorded in 3D object detection research, attributed to deep learning. However, deep learning algorithms are data-driven and require large amounts of annotated point cloud data for training and evaluation. Unlike 2D image labels, annotating point cloud data is difficult due to the limitations of sparsity, irregularity, and low resolution, which requires more manual work, and the annotation efficiency is much lower than 2D image.Therefore, we propose an annotation algorithm for point cloud data, which is pre-annotation and camera-LiDAR late fusion algorithm to easily and accurately annotate. The contributions of this study are as follows. We propose (1) a pre-annotation algorithm that employs 3D object detection and auto fitting for the easy annotation of point clouds, (2) a camera-LiDAR late fusion algorithm using 2D and 3D results for easily error checking, which helps annotators easily identify missing objects, and (3) a point cloud annotation evaluation pipeline to evaluate our experiments. The experimental results show that the proposed algorithm improves the annotating speed by 6.5 times and the annotation quality in terms of the 3D Intersection over Union and precision by 8.2 points and 5.6 points, respectively; additionally, the miss rate is reduced by 31.9 points.
CVMar 31
TreeGaussian: Tree-Guided Cascaded Contrastive Learning for Hierarchical Consistent 3D Gaussian Scene Segmentation and UnderstandingJingbin You, Zehao Li, Hao Jiang et al.
3D Gaussian Splatting (3DGS) has emerged as a real-time, differentiable representation for neural scene understanding. However, existing 3DGS-based methods struggle to represent hierarchical 3D semantic structures and capture whole-part relationships in complex scenes. Moreover, dense pairwise comparisons and inconsistent hierarchical labels from 2D priors hinder feature learning, resulting in suboptimal segmentation. To address these limitations, we introduce TreeGaussian, a tree-guided cascaded contrastive learning framework that explicitly models hierarchical semantic relationships and reduces redundancy in contrastive supervision. By constructing a multi-level object tree, TreeGaussian enables structured learning across object-part hierarchies. In addition, we propose a two-stage cascaded contrastive learning strategy that progressively refines feature representations from global to local, mitigating saturation and stabilizing training. A Consistent Segmentation Detection (CSD) mechanism and a graph-based denoising module are further introduced to align segmentation modes across views while suppressing unstable Gaussian points, enhancing segmentation consistency and quality. Extensive experiments, including open-vocabulary 3D object selection, 3D point cloud understanding, and ablation studies, demonstrate the effectiveness and robustness of our approach.
CVMay 23, 2024Code
TopoLogic: An Interpretable Pipeline for Lane Topology Reasoning on Driving ScenesYanping Fu, Wenbin Liao, Xinyuan Liu et al.
As an emerging task that integrates perception and reasoning, topology reasoning in autonomous driving scenes has recently garnered widespread attention. However, existing work often emphasizes "perception over reasoning": they typically boost reasoning performance by enhancing the perception of lanes and directly adopt MLP to learn lane topology from lane query. This paradigm overlooks the geometric features intrinsic to the lanes themselves and are prone to being influenced by inherent endpoint shifts in lane detection. To tackle this issue, we propose an interpretable method for lane topology reasoning based on lane geometric distance and lane query similarity, named TopoLogic. This method mitigates the impact of endpoint shifts in geometric space, and introduces explicit similarity calculation in semantic space as a complement. By integrating results from both spaces, our methods provides more comprehensive information for lane topology. Ultimately, our approach significantly outperforms the existing state-of-the-art methods on the mainstream benchmark OpenLane-V2 (23.9 v.s. 10.9 in TOP$_{ll}$ and 44.1 v.s. 39.8 in OLS on subset_A. Additionally, our proposed geometric distance topology reasoning method can be incorporated into well-trained models without re-training, significantly boost the performance of lane topology reasoning. The code is released at https://github.com/Franpin/TopoLogic.
LGFeb 26, 2025Code
TRIX: A More Expressive Model for Zero-shot Domain Transfer in Knowledge GraphsYucheng Zhang, Beatrice Bevilacqua, Mikhail Galkin et al.
Fully inductive knowledge graph models can be trained on multiple domains and subsequently perform zero-shot knowledge graph completion (KGC) in new unseen domains. This is an important capability towards the goal of having foundation models for knowledge graphs. In this work, we introduce a more expressive and capable fully inductive model, dubbed TRIX, which not only yields strictly more expressive triplet embeddings (head entity, relation, tail entity) compared to state-of-the-art methods, but also introduces a new capability: directly handling both entity and relation prediction tasks in inductive settings. Empirically, we show that TRIX outperforms the state-of-the-art fully inductive models in zero-shot entity and relation predictions in new domains, and outperforms large-context LLMs in out-of-domain predictions. The source code is available at https://github.com/yuchengz99/TRIX.
CVSep 11, 2025Code
InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction GenerationSirui Xu, Dongting Li, Yucheng Zhang et al.
While large-scale human motion capture datasets have advanced human motion generation, modeling and generating dynamic 3D human-object interactions (HOIs) remain challenging due to dataset limitations. Existing datasets often lack extensive, high-quality motion and annotation and exhibit artifacts such as contact penetration, floating, and incorrect hand motions. To address these issues, we introduce InterAct, a large-scale 3D HOI benchmark featuring dataset and methodological advancements. First, we consolidate and standardize 21.81 hours of HOI data from diverse sources, enriching it with detailed textual annotations. Second, we propose a unified optimization framework to enhance data quality by reducing artifacts and correcting hand motions. Leveraging the principle of contact invariance, we maintain human-object relationships while introducing motion variations, expanding the dataset to 30.70 hours. Third, we define six benchmarking tasks and develop a unified HOI generative modeling perspective, achieving state-of-the-art performance. Extensive experiments validate the utility of our dataset as a foundational resource for advancing 3D human-object interaction generation. To support continued research in this area, the dataset is publicly available at https://github.com/wzyabcas/InterAct, and will be actively maintained.
CVMay 23, 2025Code
TopoPoint: Enhance Topology Reasoning via Endpoint Detection in Autonomous DrivingYanping Fu, Xinyuan Liu, Tianyu Li et al.
Topology reasoning, which unifies perception and structured reasoning, plays a vital role in understanding intersections for autonomous driving. However, its performance heavily relies on the accuracy of lane detection, particularly at connected lane endpoints. Existing methods often suffer from lane endpoints deviation, leading to incorrect topology construction. To address this issue, we propose TopoPoint, a novel framework that explicitly detects lane endpoints and jointly reasons over endpoints and lanes for robust topology reasoning. During training, we independently initialize point and lane query, and proposed Point-Lane Merge Self-Attention to enhance global context sharing through incorporating geometric distances between points and lanes as an attention mask . We further design Point-Lane Graph Convolutional Network to enable mutual feature aggregation between point and lane query. During inference, we introduce Point-Lane Geometry Matching algorithm that computes distances between detected points and lanes to refine lane endpoints, effectively mitigating endpoint deviation. Extensive experiments on the OpenLane-V2 benchmark demonstrate that TopoPoint achieves state-of-the-art performance in topology reasoning (48.8 on OLS). Additionally, we propose DET$_p$ to evaluate endpoint detection, under which our method significantly outperforms existing approaches (52.6 v.s. 45.2 on DET$_p$). The code is released at https://github.com/Franpin/TopoPoint.
CVDec 5, 2024Code
Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic SegmentationHao Zhu, Yan Zhu, Jiayu Xiao et al.
Automated crop mapping through Satellite Image Time Series (SITS) has emerged as a crucial avenue for agricultural monitoring and management. However, due to the low resolution and unclear parcel boundaries, annotating pixel-level masks is exceptionally complex and time-consuming in SITS. This paper embraces the weakly supervised paradigm (i.e., only image-level categories available) to liberate the crop mapping task from the exhaustive annotation burden. The unique characteristics of SITS give rise to several challenges in weakly supervised learning: (1) noise perturbation from spatially neighboring regions, and (2) erroneous semantic bias from anomalous temporal periods. To address the above difficulties, we propose a novel method, termed exploring space-time perceptive clues (Exact). First, we introduce a set of spatial clues to explicitly capture the representative patterns of different crops from the most class-relative regions. Besides, we leverage the temporal-to-class interaction of the model to emphasize the contributions of pivotal clips, thereby enhancing the model perception for crop regions. Build upon the space-time perceptive clues, we derive the clue-based CAMs to effectively supervise the SITS segmentation network. Our method demonstrates impressive performance on various SITS benchmarks. Remarkably, the segmentation network trained on Exact-generated masks achieves 95% of its fully supervised performance, showing the bright promise of weakly supervised paradigm in crop mapping scenario. Our code will be publicly available.
CVMar 30
HandX: Scaling Bimanual Motion and Interaction GenerationZimu Zhang, Yucheng Zhang, Xiyan Xu et al.
Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.
CVJun 12, 2025Code
Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object DetectionXinyuan Liu, Hang Xu, Yike Ma et al.
Recent remote sensing tech advancements drive imagery growth, making oriented object detection rapid development, yet hindered by labor-intensive annotation for high-density scenes. Oriented object detection with point supervision offers a cost-effective solution for densely packed scenes in remote sensing, yet existing methods suffer from inadequate sample assignment and instance confusion due to rigid rule-based designs. To address this, we propose SSP (Semantic-decoupled Spatial Partition), a unified framework that synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two core innovations: 1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. 2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and reliably converts them into bounding boxes to form pseudo-labels for supervising the learning of downstream detectors. Experiments on DOTA-v1.0 and others demonstrate SSP\' s superiority: it achieves 45.78% mAP under point supervision, outperforming SOTA method PointOBB-v2 by 4.10%. Furthermore, when integrated with ORCNN and ReDet architectures, the SSP framework achieves mAP values of 47.86% and 48.50%, respectively. The code is available at https://github.com/antxinyuan/ssp.
QMJul 10, 2019Code
Improving Prognostic Performance in Resectable Pancreatic Ductal Adenocarcinoma using Radiomics and Deep Learning Features Fusion in CT ImagesYucheng Zhang, Edrise M. Lobo-Mueller, Paul Karanicolas et al.
As an analytic pipeline for quantitative imaging feature extraction and analysis, radiomics has grown rapidly in the past a few years. Recent studies in radiomics aim to investigate the relationship between tumors imaging features and clinical outcomes. Open source radiomics feature banks enable the extraction and analysis of thousands of predefined features. On the other hand, recent advances in deep learning have shown significant potential in the quantitative medical imaging field, raising the research question of whether predefined radiomics features have predictive information in addition to deep learning features. In this study, we propose a feature fusion method and investigate whether a combined feature bank of deep learning and predefined radiomics features can improve the prognostics performance. CT images from resectable Pancreatic Adenocarcinoma (PDAC) patients were used to compare the prognosis performance of common feature reduction and fusion methods and the proposed risk-score based feature fusion method for overall survival. It was shown that the proposed feature fusion method significantly improves the prognosis performance for overall survival in resectable PDAC cohorts, elevating the area under ROC curve by 51% compared to predefined radiomics features alone, by 16% compared to deep learning features alone, and by 32% compared to existing feature fusion and reduction methods for a combination of deep learning and predefined radiomics features.
CROct 30, 2024
HijackRAG: Hijacking Attacks against Retrieval-Augmented Large Language ModelsYucheng Zhang, Qinfeng Li, Tianyu Du et al.
Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge, making them adaptable and cost-effective for various applications. However, the growing reliance on these systems also introduces potential security risks. In this work, we reveal a novel vulnerability, the retrieval prompt hijack attack (HijackRAG), which enables attackers to manipulate the retrieval mechanisms of RAG systems by injecting malicious texts into the knowledge database. When the RAG system encounters target questions, it generates the attacker's pre-determined answers instead of the correct ones, undermining the integrity and trustworthiness of the system. We formalize HijackRAG as an optimization problem and propose both black-box and white-box attack strategies tailored to different levels of the attacker's knowledge. Extensive experiments on multiple benchmark datasets show that HijackRAG consistently achieves high attack success rates, outperforming existing baseline attacks. Furthermore, we demonstrate that the attack is transferable across different retriever models, underscoring the widespread risk it poses to RAG systems. Lastly, our exploration of various defense mechanisms reveals that they are insufficient to counter HijackRAG, emphasizing the urgent need for more robust security measures to protect RAG systems in real-world deployments.
CVJun 11, 2025
HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic SceneJianing Chen, Zehao Li, Yujun Cai et al.
Reconstructing dynamic 3D scenes from monocular videos remains a fundamental challenge in 3D vision. While 3D Gaussian Splatting (3DGS) achieves real-time rendering in static settings, extending it to dynamic scenes is challenging due to the difficulty of learning structured and temporally consistent motion representations. This challenge often manifests as three limitations in existing methods: redundant Gaussian updates, insufficient motion supervision, and weak modeling of complex non-rigid deformations. These issues collectively hinder coherent and efficient dynamic reconstruction. To address these limitations, we propose HAIF-GS, a unified framework that enables structured and consistent dynamic modeling through sparse anchor-driven deformation. It first identifies motion-relevant regions via an Anchor Filter to suppress redundant updates in static areas. A self-supervised Induced Flow-Guided Deformation module induces anchor motion using multi-frame feature aggregation, eliminating the need for explicit flow labels. To further handle fine-grained deformations, a Hierarchical Anchor Propagation mechanism increases anchor resolution based on motion complexity and propagates multi-level transformations. Extensive experiments on synthetic and real-world benchmarks validate that HAIF-GS significantly outperforms prior dynamic 3DGS methods in rendering quality, temporal coherence, and reconstruction efficiency.
IVDec 30, 2023
Deep Radon Prior: A Fully Unsupervised Framework for Sparse-View CT ReconstructionShuo Xu, Yucheng Zhang, Gang Chen et al.
Although sparse-view computed tomography (CT) has significantly reduced radiation dose, it also introduces severe artifacts which degrade the image quality. In recent years, deep learning-based methods for inverse problems have made remarkable progress and have become increasingly popular in CT reconstruction. However, most of these methods suffer several limitations: dependence on high-quality training data, weak interpretability, etc. In this study, we propose a fully unsupervised framework called Deep Radon Prior (DRP), inspired by Deep Image Prior (DIP), to address the aforementioned limitations. DRP introduces a neural network as an implicit prior into the iterative method, thereby realizing cross-domain gradient feedback. During the reconstruction process, the neural network is progressively optimized in multiple stages to narrow the solution space in radon domain for the under-constrained imaging protocol, and the convergence of the proposed method has been discussed in this work. Compared with the popular pre-trained method, the proposed framework requires no dataset and exhibits superior interpretability and generalization ability. The experimental results demonstrate that the proposed method can generate detailed images while effectively suppressing image artifacts.Meanwhile, DRP achieves comparable or better performance than the supervised methods.
CVNov 24, 2025
MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale ScenesKehua Chen, Tianlu Mao, Zhuxin Ma et al.
Recently, 3D Gaussian Splatting and its derivatives have achieved significant breakthroughs in large-scale scene reconstruction. However, how to efficiently and stably achieve high-quality geometric fidelity remains a core challenge. To address this issue, we introduce MetroGS, a novel Gaussian Splatting framework for efficient and robust reconstruction in complex urban environments. Our method is built upon a distributed 2D Gaussian Splatting representation as the core foundation, serving as a unified backbone for subsequent modules. To handle potential sparse regions in complex scenes, we propose a structured dense enhancement scheme that utilizes SfM priors and a pointmap model to achieve a denser initialization, while incorporating a sparsity compensation mechanism to improve reconstruction completeness. Furthermore, we design a progressive hybrid geometric optimization strategy that organically integrates monocular and multi-view optimization to achieve efficient and accurate geometric refinement. Finally, to address the appearance inconsistency commonly observed in large-scale scenes, we introduce a depth-guided appearance modeling approach that learns spatial features with 3D consistency, facilitating effective decoupling between geometry and appearance and further enhancing reconstruction stability. Experiments on large-scale urban datasets demonstrate that MetroGS achieves superior geometric accuracy, rendering quality, offering a unified solution for high-fidelity large-scale scene reconstruction.
MMOct 21, 2025
DeLoad: Demand-Driven Short-Video Preloading with Scalable Watch-Time EstimationTong Liu, Zhiwei Fan, Guanyan Peng et al.
Short video streaming has become a dominant paradigm in digital media, characterized by rapid swiping interactions and diverse media content. A key technical challenge is designing an effective preloading strategy that dynamically selects and prioritizes download tasks from an evolving playlist, balancing Quality of Experience (QoE) and bandwidth efficiency under practical commercial constraints. However, real world analysis reveals critical limitations of existing approaches: (1) insufficient adaptation of download task sizes to dynamic conditions, and (2) watch time prediction models that are difficult to deploy reliably at scale. In this paper, we propose DeLoad, a novel preloading framework that addresses these issues by introducing dynamic task sizing and a practical, multi dimensional watch time estimation method. Additionally, a Deep Reinforcement Learning (DRL) enhanced agent is trained to optimize the download range decisions adaptively. Extensive evaluations conducted on an offline testing platform, leveraging massive real world network data, demonstrate that DeLoad achieves significant improvements in QoE metrics (34.4% to 87.4% gain). Furthermore, after deployment on a large scale commercial short video platform, DeLoad has increased overall user watch time by 0.09% while simultaneously reducing rebuffering events and 3.76% bandwidth consumption.
CVOct 3, 2025
From Tokens to Nodes: Semantic-Guided Motion Control for Dynamic 3D Gaussian SplattingJianing Chen, Zehao Li, Yujun Cai et al.
Dynamic 3D reconstruction from monocular videos remains difficult due to the ambiguity inferring 3D motion from limited views and computational demands of modeling temporally varying scenes. While recent sparse control methods alleviate computation by reducing millions of Gaussians to thousands of control points, they suffer from a critical limitation: they allocate points purely by geometry, leading to static redundancy and dynamic insufficiency. We propose a motion-adaptive framework that aligns control density with motion complexity. Leveraging semantic and motion priors from vision foundation models, we establish patch-token-node correspondences and apply motion-adaptive compression to concentrate control points in dynamic regions while suppressing redundancy in static backgrounds. Our approach achieves flexible representational density adaptation through iterative voxelization and motion tendency scoring, directly addressing the fundamental mismatch between control point allocation and motion complexity. To capture temporal evolution, we introduce spline-based trajectory parameterization initialized by 2D tracklets, replacing MLP-based deformation fields to achieve smoother motion representation and more stable optimization. Extensive experiments demonstrate significant improvements in reconstruction quality and efficiency over existing state-of-the-art methods.
LGApr 7, 2025
KunPeng: A Global Ocean Environmental ModelYi Zhao, Jiaqi Li, Haitao Xia et al.
Inspired by the similarity of the atmosphere-ocean physical coupling mechanism, this study innovatively migrates meteorological large-model techniques to the ocean domain, constructing the KunPeng global ocean environmental prediction model. Aimed at the discontinuous characteristics of marine space, we propose a terrain-adaptive mask constraint mechanism to mitigate effectively training divergence caused by abrupt gradients at land-sea boundaries. To fully integrate far-, medium-, and close-range marine features, a longitude-cyclic deformable convolution network (LC-DCN) is employed to enhance the dynamic receptive field, achieving refined modeling of multi-scale oceanic characteristics. A Deformable Convolution-enhanced Multi-Step Prediction module (DC-MTP) is employed to strengthen temporal dependency feature extraction capabilities. Experimental results demonstrate that this model achieves an average ACC of 0.80 in 15-day global predictions at 0.25$^\circ$ resolution, outperforming comparative models by 0.01-0.08. The average mean squared error (MSE) is 0.41 (representing a 5%-31% reduction) and the average mean absolute error (MAE) is 0.44 (0.6%-21% reduction) compared to other models. Significant improvements are particularly observed in sea surface parameter prediction, deep-sea region characterization, and current velocity field forecasting. Through a horizontal comparison of the applicability of operators at different scales in the marine domain, this study reveals that local operators significantly outperform global operators under slow-varying oceanic processes, demonstrating the effectiveness of dynamic feature pyramid representations in predicting marine physical parameters.
AIDec 5, 2024
Using SlowFast Networks for Near-Miss Incident Analysis in Dashcam VideosYucheng Zhang, Koichi Emura, Eiji Watanabe
This paper classifies near-miss traffic videos using the SlowFast deep neural network that mimics the characteristics of the slow and fast visual information processed by two different streams from the M (Magnocellular) and P (Parvocellular) cells of the human brain. The approach significantly improves the accuracy of the traffic near-miss video analysis and presents insights into human visual perception in traffic scenarios. Moreover, it contributes to traffic safety enhancements and provides novel perspectives on the potential cognitive errors in traffic accidents.
IVApr 1, 2021
Explaining COVID-19 and Thoracic Pathology Model Predictions by Identifying Informative Input FeaturesAshkan Khakzar, Yang Zhang, Wejdene Mansour et al.
Neural networks have demonstrated remarkable performance in classification and regression tasks on chest X-rays. In order to establish trust in the clinical routine, the networks' prediction mechanism needs to be interpretable. One principal approach to interpretation is feature attribution. Feature attribution methods identify the importance of input features for the output prediction. Building on Information Bottleneck Attribution (IBA) method, for each prediction we identify the chest X-ray regions that have high mutual information with the network's output. Original IBA identifies input regions that have sufficient predictive information. We propose Inverse IBA to identify all informative regions. Thus all predictive cues for pathologies are highlighted on the X-rays, a desirable property for chest X-ray diagnosis. Moreover, we propose Regression IBA for explaining regression models. Using Regression IBA we observe that a model trained on cumulative severity score labels implicitly learns the severity of different X-ray regions. Finally, we propose Multi-layer IBA to generate higher resolution and more detailed attribution/saliency maps. We evaluate our methods using both human-centric (ground-truth-based) interpretability metrics, and human-independent feature importance metrics on NIH Chest X-ray8 and BrixIA datasets. The Code is publicly available.
QMJun 25, 2019
CNN-based Survival Model for Pancreatic Ductal Adenocarcinoma in Medical ImagingYucheng Zhang, Edrise M. Lobo-Mueller, Paul Karanicolas et al.
Cox proportional hazard model (CPH) is commonly used in clinical research for survival analysis. In quantitative medical imaging (radiomics) studies, CPH plays an important role in feature reduction and modeling. However, the underlying linear assumption of CPH model limits the prognostic performance. In addition, the multicollinearity of radiomic features and multiple testing problem further impedes the CPH models performance. In this work, using transfer learning, a convolutional neural network (CNN) based survival model was built and tested on preoperative CT images of resectable Pancreatic Ductal Adenocarcinoma (PDAC) patients. The proposed CNN-based survival model outperformed the traditional CPH-based radiomics approach in terms of concordance index by 22%, providing a better fit for patients' survival patterns. The proposed CNN-based survival model outperforms CPH-based radiomics pipeline in PDAC prognosis. This approach offers a better fit for survival patterns based on CT images and overcomes the limitations of conventional survival models.
QMMay 23, 2019
Prognostic Value of Transfer Learning Based Features in Resectable Pancreatic Ductal AdenocarcinomaYucheng Zhang, Edrise M. Lobo-Mueller, Paul Karanicolas et al.
Pancreatic Ductal Adenocarcinoma (PDAC) is one of the most aggressive cancers with an extremely poor prognosis. Radiomics has shown prognostic ability in multiple types of cancer including PDAC. However, the prognostic value of traditional radiomics pipelines, which are based on hand-crafted radiomic features alone is limited. Convolutional neural networks (CNNs) have been shown to outperform these feature-based models in computer vision tasks. However, training a CNN from scratch needs a large sample size which is not feasible in most medical imaging studies. As an alternative solution, CNN-based transfer learning has shown potential for achieving reasonable performance using small datasets. In this work, we developed and validated a CNN-based transfer learning approach for prognostication of PDAC patients for overall survival using two independent resectable PDAC cohorts. The proposed deep transfer learning model for prognostication of PDAC achieved the area under the receiver operating characteristic curve of 0.74, which was significantly higher than that of the traditional radiomics model (0.56) as well as a CNN model trained from scratch (0.50). These results suggest that deep transfer learning may significantly improve prognosis performance using small datasets in medical imaging.