MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants
This addresses the problem of limited data for developing unified biomedical assistants, though it is incremental as it builds on existing mixed-modal generative methods.
The authors tackled the lack of large-scale, diverse datasets for training biomedical AI assistants by creating MedMax, a 1.47-million-instance multimodal instruction-tuning dataset, which led to a 26% performance gain over Chameleon and 18.3% over GPT-4o on biomedical visual question-answering tasks.
Recent advancements in mixed-modal generative have opened new avenues for developing unified biomedical assistants capable of analyzing biomedical images, answering complex questions about them, and generating multimodal patient reports. However, existing datasets face challenges such as small sizes, limited coverage of biomedical tasks and domains, and a reliance on narrow sources. To address these gaps, we present MedMax, a large-scale multimodal biomedical instruction-tuning dataset for mixed-modal foundation models. With 1.47 million instances, MedMax encompasses a diverse range of tasks, including interleaved image-text generation, biomedical image captioning and generation, visual chat, and report understanding. These tasks span knowledge across diverse biomedical domains, including radiology and histopathology, grounded in medical papers and YouTube videos. Subsequently, we fine-tune a mixed-modal foundation model on the MedMax dataset, achieving significant performance improvements: a 26% gain over the Chameleon model and an 18.3% improvement over GPT-4o across 12 downstream biomedical visual question-answering tasks. Finally, we introduce a unified evaluation suite for biomedical tasks to guide the development of mixed-modal biomedical AI assistants. The data, model, and code is available at https://mint-medmax.github.io/.