CVAICLLGJan 5, 2025

Efficient Architectures for High Resolution Vision-Language Models

arXiv:2501.02584v220 citationsh-index: 1COLING
Originality Incremental advance
AI Analysis

This addresses limitations in vision-language models for applications requiring detailed image analysis, though it appears incremental as it builds on existing architectures.

The paper tackles the problem of fine detail recognition in high-resolution images for vision-language models, introducing Pheye, which achieves high efficiency with fewer parameters while maintaining strong performance in fine-grained and scene-text tasks.

Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes