On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications
This addresses the challenge of robust clinical decision-making when not all data modalities are available, but it is incremental as it builds on prior knowledge distillation methods.
The paper tackled the problem of missing data modalities at inference in clinical deep learning by proposing multimodal privileged knowledge distillation (MMPKD), which uses extra modalities during training to guide a unimodal vision model, resulting in improved attention maps for localizing regions of interest in chest radiographs and mammography, though the effect did not generalize across domains.
Deploying deep learning models in clinical practice often requires leveraging multiple data modalities, such as images, text, and structured data, to achieve robust and trustworthy decisions. However, not all modalities are always available at inference time. In this work, we propose multimodal privileged knowledge distillation (MMPKD), a training strategy that utilizes additional modalities available solely during training to guide a unimodal vision model. Specifically, we used a text-based teacher model for chest radiographs (MIMIC-CXR) and a tabular metadata-based teacher model for mammography (CBIS-DDSM) to distill knowledge into a vision transformer student model. We show that MMPKD can improve the resulting attention maps' zero-shot capabilities of localizing ROI in input images, while this effect does not generalize across domains, as contrarily suggested by prior research.