AILGSPJan 27, 2025

Smarter Together: Combining Large Language Models and Small Models for Physiological Signals Visual Inspection

arXiv:2501.16215v24 citationsh-index: 20J Healthc Informatics Res
Originality Incremental advance
AI Analysis

This work addresses the need for more interpretable and trustworthy AI-driven clinical decision support by synergizing LLMs and SSMs, though it is incremental as it builds on existing methods like MIL and conformal prediction.

The paper tackled the problem of improving large language models' (LLMs) visual inspection of physiological signals by combining them with small specialized models (SSMs) to address domain-specific precision and reasoning limitations, resulting in enhanced accuracy, such as increasing precision from 46.13% to 94.92% for confident samples in arrhythmia detection.

Large language models (LLMs) have shown promising capabilities in visually interpreting medical time-series data. However, their general-purpose design can limit domain-specific precision, and the proprietary nature of many models poses challenges for fine-tuning on specialized clinical datasets. Conversely, small specialized models (SSMs) offer strong performance on focused tasks but lack the broader reasoning needed for complex medical decision-making. To address these complementary limitations, we introduce \ConMIL{} (Conformalized Multiple Instance Learning), a novel decision-support framework distinctively synergizes three key components: (1) a new Multiple Instance Learning (MIL) mechanism, QTrans-Pooling, designed for per-class interpretability in identifying clinically relevant physiological signal segments; (2) conformal prediction, integrated with MIL to generate calibrated, set-valued outputs with statistical reliability guarantees; and (3) a structured approach for these interpretable and uncertainty-quantified SSM outputs to enhance the visual inspection capabilities of LLMs. Our experiments on arrhythmia detection and sleep stage classification demonstrate that \ConMIL{} can enhance the accuracy of LLMs such as ChatGPT4.0, Qwen2-VL-7B, and MiMo-VL-7B-RL. For example, \ConMIL{}-supported Qwen2-VL-7B and MiMo-VL-7B-RL both achieves 94.92% and 96.82% precision on confident samples and (70.61% and 78.02%)/(78.10% and 71.98%) on uncertain samples for the two tasks, compared to 46.13% and 13.16% using the LLM alone. These results suggest that integrating task-specific models with LLMs may offer a promising pathway toward more interpretable and trustworthy AI-driven clinical decision support.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes