CV AIJun 3, 2024

Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models

Rahul Thapa, Kezhen Chen, Ian Covert, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou

arXiv:2406.00977v214.77 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the challenge of fine-grained image understanding for vision-language models, offering a simple solution that improves performance in both general and specialized domains, though it is incremental as it builds on existing high-resolution and multi-crop techniques.

The paper tackles the problem of vision-language models struggling to capture fine-grained details from less prominent objects, charts, and embedded text by proposing Dragonfly, which uses multi-resolution zoom-in encoding to extract features from image sub-crops, achieving top performance in the 7-8B parameter range across ten general-domain benchmarks and setting new benchmarks in medical tasks with up to 91.6% accuracy on SLAKE.

Recent advances in vision-language models (VLMs) have demonstrated the advantages of processing images at higher resolutions and utilizing multi-crop features to preserve native resolution details. However, despite these improvements, existing vision transformers (ViTs) still struggle to capture fine-grained details from less prominent objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we extend recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it and extracting features from a large number of image sub-crops. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we demonstrate that a simple mean-pooling aggregation over tokens is effective. Our model, Dragonfly, achieves competitive performance on general-domain tasks such as ScienceQA and AI2D, and excels in tasks requiring fine-grained image understanding, including TextVQA and ChartQA. Among models in the 7-8B parameter range, Dragonfly consistently ranks at the top across ten general-domain benchmarks, achieving the highest or second-highest scores in most cases, outperforming models that are significantly larger or trained on larger datasets. Our biomedical model, Dragonfly-Med, sets new benchmarks on several medical tasks, achieving 91.6% accuracy on SLAKE (compared to 84.8% for Med-Gemini), a 67.1% token F1 score on Path-VQA (compared to 62.7% for Med-PaLM M), and state-of-the-art results across the majority of image captioning tasks. Overall, our work highlights the persistent challenge of engineering visual representations with fixed-resolution ViTs, and proposes a simple yet effective solution to address this issue and boost performance in both general and specialized domains.

View on arXiv PDF Code

Similar