CV AIMay 1, 2025

AdCare-VLM: Towards a Unified and Pre-aligned Latent Representation for Healthcare Video Understanding

Md Asaduzzaman Jabin, Hanqi Jiang, Yiwei Li, Patrick Kaggwa, Eugene Douglass, Juliet N. Sekandi, Tianming Liu

arXiv:2505.00275v33.6h-index: 15Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses medication adherence issues for patients with chronic diseases like tuberculosis, though it is incremental as it builds on existing LLaVA-based models with domain-specific fine-tuning.

The paper tackles the problem of medication adherence monitoring in chronic diseases by proposing AdCare-VLM, a multimodal vision-language model that improves visual question answering on patient videos, achieving absolute improvements of 3.1% to 3.54% over existing models.

Chronic diseases, including diabetes, hypertension, asthma, HIV-AIDS, epilepsy, and tuberculosis, necessitate rigorous adherence to medication to avert disease progression, manage symptoms, and decrease mortality rates. Adherence is frequently undermined by factors including patient behavior, caregiver support, elevated medical costs, and insufficient healthcare infrastructure. We propose AdCare-VLM, a specialized LLaVA-based multimodal large vision language model (LVLM) by introducing a unified visual latent space with pre-alignment to facilitate visual question answering (VQA) concerning medication adherence through patient videos. We employ a private dataset comprising 806 custom-annotated tuberculosis (TB) medication monitoring videos, which have been labeled by clinical experts, to fine-tune the model for adherence pattern detection. We present LLM-TB-VQA, a detailed medical adherence VQA dataset that encompasses positive, negative, and ambiguous adherence cases. Our method identifies correlations between visual features, such as the clear visibility of the patient's face, medication, water intake, and the act of ingestion, and their associated medical concepts in captions. This facilitates the integration of aligned visual-linguistic representations and improves multimodal interactions. Experimental results indicate that our method surpasses parameter-efficient fine-tuning (PEFT) enabled VLM models, such as LLaVA-V1.5 and Chat-UniVi, with absolute improvements ranging from 3.1% to 3.54% across pre-trained, regular, and low-rank adaptation (LoRA) configurations. Comprehensive ablation studies and attention map visualizations substantiate our approach, enhancing interpretability.

View on arXiv PDF Code

Similar