CVCLMar 26, 2025

Vision as LoRA

arXiv:2503.20680v131 citationsh-index: 14Has Code
Originality Highly original
AI Analysis

This addresses the structural complexity and computational overhead in MLLMs for AI researchers and practitioners, though it is incremental as it builds on existing LoRA and MLLM methods.

The paper tackles the problem of transforming large language models (LLMs) into multimodal large language models (MLLMs) by introducing Vision as LoRA (VoRA), which integrates vision-specific LoRA layers directly into the LLM to eliminate external vision modules, resulting in comparable performance to conventional encode-based MLLMs with additional pre-training data.

We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA's visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at https://github.com/Hon-Wong/VoRA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes