CVJan 27

Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

arXiv:2601.19798v13 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses the limitation of coarse-grained multimodal comprehension in VLMs, offering a more balanced approach for developing generalist visual agents, though it appears incremental as it builds on existing VLM architectures.

The paper tackled the problem of Vision-Language Models (VLMs) losing fine-grained visual information due to a text-dominant training bias, and introduced Youtu-VL with a unified autoregressive supervision paradigm that achieved competitive performance on multimodal and vision-centric tasks.

Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes