Multimodal Medical Endoscopic Image Analysis via Progressive Disentangle-aware Contrastive Learning
This work addresses the problem of precise tumor segmentation for medical diagnosis and treatment planning, representing an incremental improvement in multimodal medical image analysis.
The paper tackled accurate segmentation of laryngo-pharyngeal tumors by integrating 2D White Light Imaging and Narrow Band Imaging pairs, achieving superior accuracy and outperforming state-of-the-art methods in experiments on multiple datasets.
Accurate segmentation of laryngo-pharyngeal tumors is crucial for precise diagnosis and effective treatment planning. However, traditional single-modality imaging methods often fall short of capturing the complex anatomical and pathological features of these tumors. In this study, we present an innovative multi-modality representation learning framework based on the `Align-Disentangle-Fusion' mechanism that seamlessly integrates 2D White Light Imaging (WLI) and Narrow Band Imaging (NBI) pairs to enhance segmentation performance. A cornerstone of our approach is multi-scale distribution alignment, which mitigates modality discrepancies by aligning features across multiple transformer layers. Furthermore, a progressive feature disentanglement strategy is developed with the designed preliminary disentanglement and disentangle-aware contrastive learning to effectively separate modality-specific and shared features, enabling robust multimodal contrastive learning and efficient semantic fusion. Comprehensive experiments on multiple datasets demonstrate that our method consistently outperforms state-of-the-art approaches, achieving superior accuracy across diverse real clinical scenarios.