Video CLIP Model for Multi-View Echocardiography Interpretation
This addresses the need for more accurate automated diagnosis of cardiac conditions by leveraging motion and multiple views, though it appears incremental as it builds on existing vision-language models for medical data.
The authors tackled the problem of automating echocardiographic interpretation by developing a video-language model that processes full video sequences from multiple views, trained on 60,747 video-report pairs, and evaluated gains in retrieval performance.
Echocardiography records ultrasound videos of the heart, enabling clinicians to assess cardiac function. Recent advances in large-scale vision-language models (VLMs) have spurred interest in automating echocardiographic interpretation. However, most existing medical VLMs rely on single-frame (image) inputs, which can reduce diagnostic accuracy for conditions identifiable only through cardiac motion. In addition, echocardiographic videos are captured from multiple views, each varying in suitability for detecting specific conditions. Leveraging multiple views may therefore improve diagnostic performance. We developed a video-language model that processes full video sequences from five standard views, trained on 60,747 echocardiographic video-report pairs. We evaluated the gains in retrieval performance from video input and multi-view support, including the contributions of various pretrained models. Code and model weights are available at https://github.com/UTcardiology/video-echo-clip