CV AI LGSep 18, 2025

ProtoMedX: Towards Explainable Multi-Modal Prototype Learning for Bone Health Classification

Alvaro Lopez Pellicer, Andre Mariucci, Plamen Angelov, Marwan Bukhari, Jemma G. Kerns

arXiv:2509.14830v28.42 citationsh-index: 62025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Originality Incremental advance

AI Analysis

This work addresses the need for explainable AI in medical applications, specifically for clinicians diagnosing bone health conditions like Osteopenia and Osteoporosis, though it appears incremental by combining existing modalities with a prototype-based approach.

The paper tackles the problem of bone health classification by proposing ProtoMedX, a multi-modal model that uses DEXA scans and patient records to achieve state-of-the-art performance, with accuracies of 87.58% for vision-only and 89.8% for multi-modal tasks on a dataset of 4,160 patients.

Bone health studies are crucial in medical practice for the early detection and treatment of Osteopenia and Osteoporosis. Clinicians usually make a diagnosis based on densitometry (DEXA scans) and patient history. The applications of AI in this field are ongoing research. Most successful methods rely on deep learning models that use vision alone (DEXA/X-ray imagery) and focus on prediction accuracy, while explainability is often disregarded and left to post hoc assessments of input contributions. We propose ProtoMedX, a multi-modal (multimodal) model that uses both DEXA scans of the lumbar spine and patient records. ProtoMedX's prototype-based architecture is explainable by design, which is crucial for medical applications, especially in the context of the upcoming EU AI Act, as it allows explicit analysis of model decisions, including incorrect ones. ProtoMedX demonstrates state-of-the-art performance in bone health classification while also providing explanations that can be visually understood by clinicians. Using a dataset of 4,160 real NHS patients, the proposed ProtoMedX achieves 87.58% accuracy in vision-only tasks and 89.8% in its multi-modal variant, both surpassing existing published methods.

View on arXiv PDF

Similar