Revisiting Integration of Image and Metadata for DICOM Series Classification: Cross-Attention and Dictionary Learning
This work addresses automated DICOM series classification for medical image analysis, which is incremental as it builds on existing multimodal methods by explicitly modeling metadata sparsity and cross-modal interactions.
The paper tackled the challenge of DICOM series classification by proposing a multimodal framework that integrates image content and metadata, handling issues like sparsity and variability, and demonstrated consistent performance improvements over baselines on Duke Liver MRI and multi-institutional datasets.
Automated identification of DICOM image series is essential for large-scale medical image analysis, quality control, protocol harmonization, and reliable downstream processing. However, DICOM series classification remains challenging due to heterogeneous slice content, variable series length, and entirely missing, incomplete or inconsistent DICOM metadata. We propose an end-to-end multimodal framework for DICOM series classification that jointly models image content and acquisition metadata while explicitly accounting for all these challenges. (i) Images and metadata are encoded with modality-aware modules and fused using a bi-directional cross-modal attention mechanism. (ii) Metadata is processed by a sparse, missingness-aware encoder based on learnable feature dictionaries and value-conditioned modulation. By design, the approach does not require any form of imputation. (iii) Variability in series length and image data dimensions is handled via a 2.5D visual encoder and attention operating on equidistantly sampled slices. We evaluate the proposed approach on the publicly available Duke Liver MRI dataset and a large multi-institutional in-house cohort, assessing both in-domain performance and out-of-domain generalization. Across all evaluation settings, the proposed method consistently outperforms relevant image only, metadata-only and multimodal 2D/3D baselines. The results demonstrate that explicitly modeling metadata sparsity and cross-modal interactions improves robustness for DICOM series classification.