CV AIMay 29

Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation

Zhiyuan Yang, Jiahao Cheng, Vincent Quoc-Huy Trinh, Mahdi S. Hosseini

arXiv:2605.3071638.0h-index: 5

AI Analysis

This work provides a strong, reproducible baseline for efficient pathology report generation, lowering the barrier to multi-WSI VLM research for researchers and clinicians with limited computational resources.

This paper introduces a token-efficient vision-language model for generating pathology reports from whole-slide images (WSIs), addressing challenges like gigapixel resolution and complex case-level reasoning. The model achieves high ROUGE-L/METEOR/BLEU-4 scores while reducing average sequence length by up to 64 times and enabling practical training on half an NVIDIA H100 GPU.

Generating clinically useful pathology reports for pathology cases from whole-slide images (WSIs) is challenging due to gigapixel resolution, long visual-token sequences, and the complexity of case-level reasoning, where a single case may contain multiple WSIs with heterogeneous tissues and ambiguous findings. We present a simple token-efficient vision--language model for case-level synoptic report generation that remains practical under constrained GPU memory. Our architecture follows a minimal three-component design: a frozen pathology patch encoder, a lightweight two-layer MLP vision-language aligner, and a large language model decoder, with an explicit WSI marker token to separate slides within a case. Training proceeds in two supervised stages: (1) aligner-only WSI captioning using heterogeneous WSI-text pairs, and (2) case-level supervised fine-tuning on case-report pairs for structured report generation. To reduce sequence length, we represent each slide using $512 \times 512$ patches at $5\times$ magnification, which reduces the average sequence length by up to $64\times$ times compared to the commonly used $20\times$ patches. Combined with efficient training techniques, we enable practical training with only half a NVIDIA H100 GPU. Across both training stages, our approach achieves high ROUGE-L/METEOR/BLEU-4 scores while being substantially more efficient in memory and runtime. In AI-based evaluations, our model is consistently preferred over strong baselines. Extensive ablations characterize performance-efficiency trade-offs and identify simple choices that improve robustness in multi-WSI settings. Overall, this work provides a strong, reproducible baseline for efficient pathology report generation, lowering the barrier to multi-WSI VLM research under limited compute.

View on arXiv PDF

Similar