CVAIJun 3, 2024

Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models

arXiv:2406.00977v27 citations
AI Analysis

This addresses the challenge of fine-grained image understanding for vision-language models, offering a simple solution that improves performance in both general and specialized domains, though it is incremental as it builds on existing high-resolution and multi-crop techniques.

The paper tackles the problem of vision-language models struggling to capture fine-grained details from less prominent objects, charts, and embedded text by proposing Dragonfly, which uses multi-resolution zoom-in encoding to extract features from image sub-crops, achieving top performance in the 7-8B parameter range across ten general-domain benchmarks and setting new benchmarks in medical tasks with up to 91.6% accuracy on SLAKE.

Recent advances in vision-language models (VLMs) have demonstrated the advantages of processing images at higher resolutions and utilizing multi-crop features to preserve native resolution details. However, despite these improvements, existing vision transformers (ViTs) still struggle to capture fine-grained details from less prominent objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we extend recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it and extracting features from a large number of image sub-crops. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we demonstrate that a simple mean-pooling aggregation over tokens is effective. Our model, Dragonfly, achieves competitive performance on general-domain tasks such as ScienceQA and AI2D, and excels in tasks requiring fine-grained image understanding, including TextVQA and ChartQA. Among models in the 7-8B parameter range, Dragonfly consistently ranks at the top across ten general-domain benchmarks, achieving the highest or second-highest scores in most cases, outperforming models that are significantly larger or trained on larger datasets. Our biomedical model, Dragonfly-Med, sets new benchmarks on several medical tasks, achieving 91.6% accuracy on SLAKE (compared to 84.8% for Med-Gemini), a 67.1% token F1 score on Path-VQA (compared to 62.7% for Med-PaLM M), and state-of-the-art results across the majority of image captioning tasks. Overall, our work highlights the persistent challenge of engineering visual representations with fixed-resolution ViTs, and proposes a simple yet effective solution to address this issue and boost performance in both general and specialized domains.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes