AI CL CVDec 3, 2024

ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?

Leixin Zhang, Steffen Eger, Yinjie Cheng, Weihe Zhai, Jonas Belouadi, Christoph Leiter, Simone Paolo Ponzetto, Fahimeh Moafian, Zhixue Zhao

arXiv:2412.02368v111.67 citationsh-index: 12Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses a critical gap in accelerating scientific progress by benchmarking LLMs for scientific image generation, though it is incremental as it focuses on evaluation rather than new model development.

The authors tackled the problem of evaluating multimodal large language models (LLMs) for generating scientific images from text by introducing the ScImage benchmark, which assesses spatial, numeric, and attribute comprehension, and found that while GPT-4o performed decently on simpler prompts, all models struggled with complex tasks.

Multimodal large language models (LLMs) have demonstrated impressive capabilities in generating high-quality images from textual instructions. However, their performance in generating scientific images--a critical application for accelerating scientific progress--remains underexplored. In this work, we address this gap by introducing ScImage, a benchmark designed to evaluate the multimodal capabilities of LLMs in generating scientific images from textual descriptions. ScImage assesses three key dimensions of understanding: spatial, numeric, and attribute comprehension, as well as their combinations, focusing on the relationships between scientific objects (e.g., squares, circles). We evaluate five models, GPT-4o, Llama, AutomaTikZ, Dall-E, and StableDiffusion, using two modes of output generation: code-based outputs (Python, TikZ) and direct raster image generation. Additionally, we examine four different input languages: English, German, Farsi, and Chinese. Our evaluation, conducted with 11 scientists across three criteria (correctness, relevance, and scientific accuracy), reveals that while GPT-4o produces outputs of decent quality for simpler prompts involving individual dimensions such as spatial, numeric, or attribute understanding in isolation, all models face challenges in this task, especially for more complex prompts.

View on arXiv PDF Code

Similar