CVMay 25

RAPTOR+: A Visually Grounded Vision-Language Framework to Improve Clinical Trust and Auditability in Automated Cancer Referral Processing

arXiv:2605.2595670.5Has Code
Predicted impact top 42% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For clinical NLP and healthcare AI, this work addresses the need for verifiable evidence grounding in automated cancer referral triage, though it is an incremental extension of the existing RAPTOR system.

RAPTOR+ uses Vision-Language Models for end-to-end understanding of colorectal cancer referral forms, achieving 96.1% reading accuracy and 60.6% strict safety with fine-tuned Qwen3-VL-8B, compared to 92.6% accuracy and 1.2% strict safety for zero-shot Gemini 2.5 Flash, demonstrating that fine-tuning is essential for reliable and auditable clinical document processing.

Urgent suspected colorectal cancer (CRC) referrals create operational bottlenecks because semi-structured clinical documents often require manual review and transcription. The original RAPTOR system used Large Language Models for structured extraction but relied on a separate OCR stage, making it vulnerable to handwriting, layout variation, and loss of visual evidence linkage. We present RAPTOR+, a multimodal extension that uses Vision-Language Models (VLMs) for end-to-end referral understanding. We evaluate fine-tuned VLMs, commercial and open-source zero-shot VLMs, and the original OCR-based pipeline on 223 clinically curated CRC urgent referral forms. We also introduce a grounding-aware evaluation framework that measures both extraction accuracy and evidence localisation. Results show a clear grounding gap in zero-shot models. Gemini 2.5 Flash achieved 92.6% Reading Accuracy but only 1.2% Strict Safety. In contrast, fine-tuned Qwen3-VL-8B achieved 96.1% Reading Accuracy and 60.6% Strict Safety, substantially improving verifiable evidence grounding. These findings show that task-specific fine-tuning is essential for reliable, auditable clinical document understanding. RAPTOR+ enables extracted referral decisions to be linked to visual evidence, supporting safer and more efficient cancer referral triage.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes