CVJul 27, 2024Code
GP-VLS: A general-purpose vision language model for surgerySamuel Schmidgall, Joseph Cho, Cyril Zakka et al.
Surgery requires comprehensive medical knowledge, visual assessment skills, and procedural expertise. While recent surgical AI models have focused on solving task-specific problems, there is a need for general-purpose systems that can understand surgical scenes and interact through natural language. This paper introduces GP-VLS, a general-purpose vision language model for surgery that integrates medical and surgical knowledge with visual scene understanding. For comprehensively evaluating general-purpose surgical models, we propose SurgiQual, which evaluates across medical and surgical knowledge benchmarks as well as surgical vision-language questions. To train GP-VLS, we develop six new datasets spanning medical knowledge, surgical textbooks, and vision-language pairs for tasks like phase recognition and tool identification. We show that GP-VLS significantly outperforms existing open- and closed-source models on surgical vision-language tasks, with 8-21% improvements in accuracy across SurgiQual benchmarks. GP-VLS also demonstrates strong performance on medical and surgical knowledge tests compared to open-source alternatives. Overall, GP-VLS provides an open-source foundation for developing AI assistants to support surgeons across a wide range of tasks and scenarios. The code and data for this work is publicly available at gpvls-surgery-vlm.github.io.
CVAug 26, 2024
SurGen: Text-Guided Diffusion Model for Surgical Video GenerationJoseph Cho, Samuel Schmidgall, Cyril Zakka et al.
Diffusion-based video generation models have made significant strides, producing outputs with improved visual fidelity, temporal coherence, and user control. These advancements hold great promise for improving surgical education by enabling more realistic, diverse, and interactive simulation environments. In this study, we introduce SurGen, a text-guided diffusion model tailored for surgical video synthesis. SurGen produces videos with the highest resolution and longest duration among existing surgical video generation models. We validate the visual and temporal quality of the outputs using standard image and video generation metrics. Additionally, we assess their alignment to the corresponding text prompts through a deep learning classifier trained on surgical data. Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.
CVApr 21
A Generalist Model for Diverse Text-Guided Medical Image SynthesisJoseph Cho, Mrudang Mathur, Cyril Zakka et al.
Deep learning algorithms require extensive data to achieve robust performance. However, data availability is often restricted in the medical domain due to patient privacy concerns. Synthetic data presents a possible solution to these challenges. Image generative models have found increasing use for medical applications, but are often task-specific, thus limiting their scalability. Moreover, existing models frequently rely on private datasets for training, which constrain their reproducibility. To address this, we introduce MediSyn: an open-access, generalist, text-guided latent diffusion model capable of generating synthetic images across 6 medical specialties and 10 imaging modalities, while being trained exclusively on publicly available data. Through extensive experimentation, we provide several key contributions. First, we demonstrate that training a generative model on visually diverse medical images does not degrade synthetic image quality. Second, we show that this generalist approach is substantially more computationally efficient than a coordinated suite of task-specific models. Third, we establish that a generalist model can produce realistic, text-aligned synthetic images across visually and medically distinct modalities, as validated by expert physicians. Fourth, we provide empirical evidence that these synthetic images are visually distinct from their corresponding real patient images, alleviating concerns about data memorization in image generative models. Finally, we demonstrate that a generalist model can produce synthetic images that improve classifier performance in data-limited settings across multiple medical specialties. Altogether, our findings highlight the immense potential of generalist image generative models to accelerate algorithmic research and development in medicine.
IVMar 25
A Generalizable Deep Learning System for Cardiac MRIRohan Shad, Cyril Zakka, Dhamanpreet Kaur et al.
Cardiac MRI allows for a comprehensive assessment of myocardial structure, function and tissue characteristics. Here we describe a foundational vision system for cardiac MRI, capable of representing the breadth of human cardiovascular disease and health. Our deep-learning model is trained via self-supervised contrastive learning, in which visual concepts in cine-sequence cardiac MRI scans are learned from the raw text of the accompanying radiology reports. We train and evaluate our model on data from four large academic clinical institutions in the United States. We additionally showcase the performance of our models on the UK BioBank and two additional publicly available external datasets. We explore emergent capabilities of our system and demonstrate remarkable performance across a range of tasks, including the problem of left-ventricular ejection fraction regression and the diagnosis of 39 different conditions such as cardiac amyloidosis and hypertrophic cardiomyopathy. We show that our deep-learning system is capable of not only contextualizing the staggering complexity of human cardiovascular disease but can be directed towards clinical problems of interest, yielding impressive, clinical-grade diagnostic accuracy with a fraction of the training data typically required for such tasks.
AIMar 8, 2024
Sora as an AGI World Model? A Complete Survey on Text-to-Video GenerationJoseph Cho, Fachrina Dewi Puspitasari, Sheng Zheng et al.
The evolution of video generation from text, starting with animating MNIST numbers to simulating the physical world with Sora, has progressed at a breakneck speed over the past seven years. While often seen as a superficial expansion of the predecessor text-to-image generation model, text-to-video generation models are developed upon carefully engineered constituents. Here, we systematically discuss these elements consisting of but not limited to core building blocks (vision, language, and temporal) and supporting features from the perspective of their contributions to achieving a world model. We employ the PRISMA framework to curate 97 impactful research articles from renowned scientific databases primarily studying video synthesis using text conditions. Upon minute exploration of these manuscripts, we observe that text-to-video generation involves more intricate technologies beyond the plain extension of text-to-image generation. Our additional review into the shortcomings of Sora-generated videos pinpoints the call for more in-depth studies in various enabling aspects of video generation such as dataset, evaluation metric, efficient architecture, and human-controlled generation. Finally, we conclude that the study of the text-to-video generation may still be in its infancy, requiring contribution from the cross-discipline research community towards its advancement as the first step to realize artificial general intelligence (AGI).
AIApr 30, 2024
Almanac Copilot: Towards Autonomous Electronic Health Record NavigationCyril Zakka, Joseph Cho, Gracia Fahed et al.
Clinicians spend large amounts of time on clinical documentation, and inefficiencies impact quality of care and increase clinician burnout. Despite the promise of electronic medical records (EMR), the transition from paper-based records has been negatively associated with clinician wellness, in part due to poor user experience, increased burden of documentation, and alert fatigue. In this study, we present Almanac Copilot, an autonomous agent capable of assisting clinicians with EMR-specific tasks such as information retrieval and order placement. On EHR-QA, a synthetic evaluation dataset of 300 common EHR queries based on real patient data, Almanac Copilot obtains a successful task completion rate of 74% (n = 221 tasks) with a mean score of 2.45 over 3 (95% CI:2.34-2.56). By automating routine tasks and streamlining the documentation process, our findings highlight the significant potential of autonomous agents to mitigate the cognitive load imposed on clinicians by current EMR systems.
CVMay 16, 2024
MediSyn: A Generalist Text-Guided Latent Diffusion Model For Diverse Medical Image SynthesisJoseph Cho, Mrudang Mathur, Cyril Zakka et al.
Deep learning algorithms require extensive data to achieve robust performance. However, data availability is often restricted in the medical domain due to patient privacy concerns. Synthetic data presents a possible solution to these challenges. Recently, image generative models have found increasing use for medical applications but are often designed for singular medical specialties and imaging modalities, thus limiting their broader utility. To address this, we introduce MediSyn: a text-guided, latent diffusion model capable of generating synthetic images from 6 medical specialties and 10 image types. Through extensive experimentation, we first demonstrate that MediSyn quantitatively matches or surpasses the performance of specialist models. Second, we show that our synthetic images are realistic and exhibit strong alignment with their corresponding text prompts, as validated by a team of expert physicians. Third, we provide empirical evidence that our synthetic images are visually distinct from their corresponding real patient images. Finally, we demonstrate that in data-limited settings, classifiers trained solely on synthetic data or real data supplemented with synthetic data can outperform those trained solely on real data. Our findings highlight the immense potential of generalist image generative models to accelerate algorithmic research and development in medicine.
CVMay 12, 2023
A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt EngineeringChaoning Zhang, Joseph Cho, Fachrina Dewi Puspitasari et al.
The Segment Anything Model (SAM), developed by Meta AI Research, represents a significant breakthrough in computer vision, offering a robust framework for image and video segmentation. This survey provides a comprehensive exploration of the SAM family, including SAM and SAM 2, highlighting their advancements in granularity and contextual understanding. Our study demonstrates SAM's versatility across a wide range of applications while identifying areas where improvements are needed, particularly in scenarios requiring high granularity and in the absence of explicit prompts. By mapping the evolution and capabilities of SAM models, we offer insights into their strengths and limitations and suggest future research directions, including domain-specific adaptations and enhanced memory and propagation mechanisms. We believe that this survey comprehensively covers the breadth of SAM's applications and challenges, setting the stage for ongoing advancements in segmentation technology.
CVMay 10, 2023
Generative AI meets 3D: A Survey on Text-to-3D in AIGC EraChenghao Li, Chaoning Zhang, Joseph Cho et al.
Generative AI has made significant progress in recent years, with text-guided content generation being the most practical as it facilitates interaction between human instructions and AI-generated content (AIGC). Thanks to advancements in text-to-image and 3D modeling technologies, like neural radiance field (NeRF), text-to-3D has emerged as a nascent yet highly active research field. Our work conducts a comprehensive survey on this topic and follows up on subsequent research progress in the overall field, aiming to help readers interested in this direction quickly catch up with its rapid development. First, we introduce 3D data representations, including both Structured and non-Structured data. Building on this pre-requisite, we introduce various core technologies to achieve satisfactory text-to-3D results. Additionally, we present mainstream baselines and research directions in recent text-to-3D technology, including fidelity, efficiency, consistency, controllability, diversity, and applicability. Furthermore, we summarize the usage of text-to-3D technology in various applications, including avatar generation, texture generation, scene generation and 3D editing. Finally, we discuss the agenda for the future development of text-to-3D.